From my recent few posts you can see I'm experimenting with a small Spark cluster I built on my server at home. Although this machine was built with server grade parts, it was built 4 years ago, so not top of the line by any standard. One Xeon processor running at 3.1 GHz with 4 cores, 32 GB of DDR3 RAM and consumer (not server grade) SSD. I'm running 3 VMs on this machine, each one using only one core. Naturally I did not expect Spark processing on my cluster to be performant, but to my surprise, performance of these one core machines beats an Azure's HDInsight cluster with D12 v2 machines which have 4 cores each.
The test I ran was to run CrossValidator on LogisticRegression of a dataset of only 5242, with only 2 features:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.5, 2.0])
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
#.addGrid(lr.maxIter, [5, 10])
.build())
start = time.time()
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, \
evaluator=evaluator, numFolds=5, parallelism=1)
# Run cross validations
cvModel = cv.fit(train)
end = time.time()
print("Elapsed {} sec".format(end - start))
The only timing I'm comparing is the shown above between start and end timers. After multiple runs to get cached data in place, I can see consistent values for execution time. I'm setting the parallelism=1 so I can measure performance of only one node. Azure nodes are 1.5-2x slower. Here is summary of the comparison:
|
Desktop Server (Spark 2.3) |
Azure HDInsight (Spark 2.0) |
Azure HDInsight (Spark 2.3) |
Number of Cores |
1 |
4 |
4 |
Average execution time |
29 s |
57 s |
41 s |
Cost |
$500 (one time) |
$500 (per month) |
$500 (per month) |
I cannot explain why this is the case and why Azure Spark cluster performs poorly. We have a dozen clusters with more than 100 nodes, at list price of almost $500 per month. This adds up to $50,000 per month, for a bunch of VMs which perform 1.5-2X slower than my home made server from 4 years ago.
Conclusion
If you are considering using cloud based clusters of Spark nodes (or whatever else), maybe you should do some cost calculation and see if you are getting what you are paying for.
0b015931-91f7-41cf-86c9-047abe7991d0|1|5.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04