Massoud Mazar

Sharing The Knowledge

NAVIGATION - SEARCH

Azure HDInsight performance benchmarking

I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all.

tl;dr

If you seriously use hadoop/spark/hive/... for 24/7 operation of real workloads and not just for short periods of few hours per day, a self install of such cluster (even in the same Azure Cloud), performs at least 5 times faster with half the cost.

Use Case

We have 4 different HDInsight clusters which deal with about 70 TB of data stored in Azure blobs. Our total cost of operating this specific setup according to Microsoft Azure Price calculator is about $30,000 per month. We are relying on Hive to query our data. Query performance is so bad, it got us thinking about alternatives.

Comparison

 I used different types of queries to compare performance and ended up with a query which is optimized with partition Id used in the Hive table. Cost calculation is based on 100 TB of data.

   HDInsight + Blob Storage Self Install + Managed Disks 
Number of Clusters 4 1
Monthly cost of Datanodes $24,000 $9,000
Cost for 100 TB $5,000 $4,000
Query 1.5 million rows 50 seconds 9 seconds
Query 50 million rows 344 seconds 13 seconds
Total monthly cost $29,000 $13,000

 

5x-25x performance improvement for less than half the price!

Then why use HDInsight?

HDInsight only makes sense when you have a lot of data, and you do not need to keep a large cluster running 24/7. 

Add comment