Massoud Mazar

Sharing The Knowledge

NAVIGATION - SEARCH

Custom Hadoop RecordReader to read JSON with no line breaks

This past week I had to deal with loading few terra bytes of data into our Spark cluster. This data is stored in a JSON array, and there is no line break to separate individual JSON objects. Spark can easily deal with JSON, but your JSON must be one object per line. I had to write a custom Hadoop RecordReader to work around this issue.More...

Azure HDInsight performance benchmarking

I did a brief performance benchmark of spark execution time in Azure HDInsight spark couple of months ago and the result was very disappointing. Recently I did a much deeper investigation and benchmarking and cost analysis of the Azure HDInsight to see does it make ANY sense to use it, and results do not surprise me at all. More...

GPU assisted Machine Learning: Benchmark

A recent project at work, involving binary classification using a Keras LSTM layer with 1000 nodes which took almost an hour to run initiated my effort to speedup this type of problems. In my previous post, I explained the hardware and software configuration I'm about to use for this benchmark. Now I'm going to run the same training exercise with and without GPU and compare the runtimes. More...

GPU and ML: Setting up CUDA + Ubuntu 18.04 on Supermicro X10 server board

There are lots of blog posts explaining how to setup a Machine Learning system with GPU support, but what I ended up going through I could not find anywhere. Due to specific hardware and software combination I'm using, I had to figure out how to do thing and in what order for this to work. I may have gone through a dozen full reinstalls before I got a stable and working setup. That's why I'm writing it down here so it may save someone else a lot of time.More...

Azure Spark (HDInsight) performance is terrible, here is why

From my recent few posts you can see I'm experimenting with a small Spark cluster I built on my server at home. Although this machine was built with server grade parts, it was built 4 years ago, so not top of the line by any standard. One Xeon processor running at 3.1 GHz with 4 cores, 32 GB of DDR3 RAM and consumer (not server grade) SSD. I'm running 3 VMs on this machine, each one using only one core. Naturally I did not expect Spark processing on my cluster to be performant, but to my surprise, performance of these one core machines beats an Azure's HDInsight cluster with D12 v2 machines which have 4 cores each.More...

Correct way of setting up Jupyter Kernels for Spark

In my post few days ago, I provided an example for kernel.json file to get PySpark working with Jupyter notebooks. Then I realized magics like %%sql are not working for me. It turns out I was missing some other configuration and code which is already provided by SparkMagic library. Their GitHub repository has great instructions on how to install it, but since it took me a little bit to get it to work, I'm sharing what I learned. More...

matplotlib charts in Jupyter notebook

When displaying graphs and charts in PySpark Jupyter notebook, you will have to jump through some hoops. To demonstrate, I'm assuming I have my K-Means clustering results as follows:

model = KMeans(k=5, seed=1).fit(features.select('features'))
predictions = model.transform(features)

You have to create a Temp View for this data, so you can run SQL on it: More...

Getting Jupyterhub 0.9.1 to work with my spark cluster and Python 3.6

My 4th of July week project was to build a Spark cluster on my home server so I can start doing experiments with PySpark. I built a small server in 2014 which I have not been utilizing recently so I decided to use that. It has 32 GB RAM, 1 TB SSD and a Quad Core Xeon processor. I decided to use the latest software, so I upgraded everything from IPMI firmware, Server firmware, to VmWare ESXI server. Then created 3 CentOs 7 VMs with 1 CPU, 8 GB ram and 50 GB SSD storage for each. More...

Show progress bar when pre-loading data in Shiny app

When finishing my capstone project for Coursera's Data Science Certificate track, I needed to load relatively large amount of data (more than 400 MB when loaded). This load operation was taking a few seconds so to let the user of my Shiny app know they need to wait, I decided to use a progress bar. More...

IoT on Azure: Why you should not use EventHub as shock abzorber

In a common IoT scenario, millions of devices will be sending data to your back end. It is possible that a large percentage of these devices could flood your back end for whatever reason. Few years back, I experienced it first had when a bad software update created a tsunami of requests towards a relatively scaleable back end, and caused an outage which lasted a whole weekend.

One approach (which I have been dealing with in the past few months) to prevent such flood on the back end is to create a so called "shock absorber" using a queue like message delivery system. More...