Massoud Mazar

Sharing The Knowledge

NAVIGATION - SEARCH

Azure HDInsight performance benchmarking

I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all.

tl;dr

If you seriously use hadoop/spark/hive/... for 24/7 operation of real workloads and not just for short periods of few hours per day, a self install of such cluster (even in the same Azure Cloud), performs at least 5 times faster with half the cost.

Use Case

We have 4 different HDInsight clusters which deal with about 70 TB of data stored in Azure blobs. Our total cost of operating this specific setup according to Microsoft Azure Price calculator is more than $32,000 per month. We are relying on Hive to query this data. Query performance is so disappointing, to say the least, it got us thinking about alternatives.

Comparison

I used different types of queries to compare performance and ended up with a query which is optimized with partition Id used in one of the Hive tables. Cost calculation is based on 100 TB of data.

   HDInsight + Blob Storage Self Install + Managed Disks 
Versions  HDFS+Yarn 2.7.3 HDFS+Yarn 3.1.0 
Number of Clusters 4 1
Monthly cost of VMs $27,000 $9,000
Cost for 100 TB $5,000 $4,000
Search few rows in 1.5 million rows 50 seconds 9 seconds
Search few rows in 50 million rows 344 seconds 13 seconds
Total monthly cost $32,000 $13,000

 

5x-25x performance improvement for less than half the price!

Another comparison showed even same version of HDFS+Yarn used by Azure HDInsight when uses managed disk instead of Azure Blob Storage can provide much better performance:

  Self Install + Blob Storage Self Install + Managed Disks 
Versions  HDFS+Yarn 2.7.3 HDFS+Yarn 2.7.3 
Search few rows in 1.5 million rows and return few fields 85 seconds 12 seconds
Search few rows in 50 million rows and return all fields More than an Hour 138 seconds

 

To reiterate, above experiment showed Azure Blob storage is a terrible choice when querying large amount of data and returning many columns. Amazingly all Microsoft performance comparisons show storing HDFS files on blobs is as performant as local managed disks.

Then why use HDInsight?

HDInsight only makes sense when you have a lot (peta bytes) of data, and you do not use this data much, meaning you don't need to keep a large cluster running 24/7. In other words, a research scenario and not an operational mission-critical environment.

Add comment