Massoud Mazar

Azure HDInsight performance benchmarking

11. September 2018 10:08 / Administrator / / Comments (0)

I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all.

tl;dr

If you seriously use hadoop/spark/hive/... for 24/7 operation of real workloads and not just for short periods of few hours per day, a self install of such cluster (even in the same Azure Cloud), performs at least 5 times faster with half the cost.

Use Case

We have 4 different HDInsight clusters which deal with about 70 TB of data stored in Azure blobs. Our total cost of operating this specific setup according to Microsoft Azure Price calculator is more than $32,000 per month. We are relying on Hive to query this data. Query performance is so disappointing, to say the least, it got us thinking about alternatives.

Comparison

I used different types of queries to compare performance and ended up with a query which is optimized with partition Id used in one of the Hive tables. Cost calculation is based on 100 TB of data.

	HDInsight + Blob Storage	Self Install + Managed Disks
Versions	HDFS+Yarn 2.7.3	HDFS+Yarn 3.1.0
Number of Clusters	4	1
Monthly cost of VMs	$27,000	$9,000
Cost for 100 TB	$5,000	$4,000
Search few rows in 1.5 million rows	50 seconds	9 seconds
Search few rows in 50 million rows	344 seconds	13 seconds
Total monthly cost	$32,000	$13,000

5x-25x performance improvement for less than half the price!

Another comparison showed even same version of HDFS+Yarn used by Azure HDInsight when uses managed disk instead of Azure Blob Storage can provide much better performance:

	Self Install + Blob Storage	Self Install + Managed Disks
Versions	HDFS+Yarn 2.7.3	HDFS+Yarn 2.7.3
Search few rows in 1.5 million rows and return few fields	85 seconds	12 seconds
Search few rows in 50 million rows and return all fields	More than an Hour	138 seconds

To reiterate, above experiment showed Azure Blob storage is a terrible choice when querying large amount of data and returning many columns. Amazingly all Microsoft performance comparisons show storing HDFS files on blobs is as performant as local managed disks.

Then why use HDInsight?

HDInsight only makes sense when you have a lot (peta bytes) of data, and you do not use this data much, meaning you don't need to keep a large cluster running 24/7. In other words, a research scenario and not an operational mission-critical environment.

1fffc736-19c4-4b5b-9b8c-65584691a06b|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04

Tags :

Azure HDInsight performance benchmarking

tl;dr

Use Case

Comparison

Then why use HDInsight?

Related posts

Add comment