11. September 2018 10:08
I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all.
If you seriously use hadoop/spark/hive/... for 24/7 operation of real workloads and not just for short periods of few hours per day, a self install of such cluster (even in the same Azure Cloud), performs at least 5 times faster with half the cost.
We have 4 different HDInsight clusters which deal with about 70 TB of data stored in Azure blobs. Our total cost of operating this specific setup according to Microsoft Azure Price calculator is about $30,000 per month. We are relying on Hive to query our data. Query performance is so bad, it got us thinking about alternatives.
I used different types of queries to compare performance and ended up with a query which is optimized with partition Id used in the Hive table. Cost calculation is based on 100 TB of data.
|| HDInsight + Blob Storage
||Self Install + Managed Disks
|Number of Clusters
|Monthly cost of Datanodes
|Cost for 100 TB
|Query 1.5 million rows
|Query 50 million rows
|Total monthly cost
5x-25x performance improvement for less than half the price!
Then why use HDInsight?
HDInsight only makes sense when you have a lot of data, and you do not need to keep a large cluster running 24/7.