Massoud Mazar

Sharing The Knowledge

NAVIGATION - SEARCH

Azure Spark (HDInsight) performance is terrible, here is why

From my recent few posts you can see I'm experimenting with a small Spark cluster I built on my server at home. Although this machine was built with server grade parts, it was built 4 years ago, so not top of the line by any standard. One Xeon processor running at 3.1 GHz with 4 cores, 32 GB of DDR3 RAM and consumer (not server grade) SSD. I'm running 3 VMs on this machine, each one using only one core. Naturally I did not expect Spark processing on my cluster to be performant, but to my surprise, performance of these one core machines beats an Azure's HDInsight cluster with D12 v2 machines which have 4 cores each.More...

Correct way of setting up Jupyter Kernels for Spark

In my post few days ago, I provided an example for kernel.json file to get PySpark working with Jupyter notebooks. Then I realized magics like %%sql are not working for me. It turns out I was missing some other configuration and code which is already provided by SparkMagic library. Their GitHub repository has great instructions on how to install it, but since it took me a little bit to get it to work, I'm sharing what I learned. More...

matplotlib charts in Jupyter notebook

When displaying graphs and charts in PySpark Jupyter notebook, you will have to jump through some hoops. To demonstrate, I'm assuming I have my K-Means clustering results as follows:

model = KMeans(k=5, seed=1).fit(features.select('features'))
predictions = model.transform(features)

You have to create a Temp View for this data, so you can run SQL on it: More...

Getting Jupyterhub 0.9.1 to work with my spark cluster and Python 3.6

My 4th of July week project was to build a Spark cluster on my home server so I can start doing experiments with PySpark. I built a small server in 2014 which I have not been utilizing recently so I decided to use that. It has 32 GB RAM, 1 TB SSD and a Quad Core Xeon processor. I decided to use the latest software, so I upgraded everything from IPMI firmware, Server firmware, to VmWare ESXI server. Then created 3 CentOs 7 VMs with 1 CPU, 8 GB ram and 50 GB SSD storage for each. More...

Show progress bar when pre-loading data in Shiny app

When finishing my capstone project for Coursera's Data Science Certificate track, I needed to load relatively large amount of data (more than 400 MB when loaded). This load operation was taking a few seconds so to let the user of my Shiny app know they need to wait, I decided to use a progress bar. More...

IoT on Azure: Why you should not use EventHub as shock abzorber

In a common IoT scenario, millions of devices will be sending data to your back end. It is possible that a large percentage of these devices could flood your back end for whatever reason. Few years back, I experienced it first had when a bad software update created a tsunami of requests towards a relatively scaleable back end, and caused an outage which lasted a whole weekend.

One approach (which I have been dealing with in the past few months) to prevent such flood on the back end is to create a so called "shock absorber" using a queue like message delivery system. More...

How replacing ElasticSearch with Azure DocumentDB (CosmosDB) turned out to be a bad idea

Disclaimer: this is my personal opinion and not the opinion of my colleagues or my employer.

History

We used to store Terra Bytes of data in ElasticSearch in form of JSON documents. As the size of data stored in cluster grew, we had to create new clusters with lots of nodes and it turned to a maintenance and cost nightmare. Microsoft Azure team suggested we move to DocumentDB to reduce the cost, and since it can scale infinitely, there won't be any maintenance needed.  More...

Azure DocumentDB (Cosmos DB): Cross partition query does not return data if TOP is not specified

When querying Azure DocumentDB (recently renamed to Cosmos DB, to make you really believe it can handle anything you throw at it), best practice is to have a Partition Id. In case of high volume scenarios, it's mandatory to partition your data.

You may select a good partition key, but there are always scenarios where you need to query your data without knowing the Partition Id. These cross partition queries, although slower, are possible by specifying EnableCrossPartitionQuery = true in your FeedOptions, when you are creating your query:More...

Develop Go App Engine API using Visual Studio Code

If you are using Go extension for Visual Studio Code to develop Google App Engine backend code on a Windows machine, you may have encountered strange problems related to environment variables GOROOT and GOPATH, specially when you want to use 3rd party libraries and expect the extension to correctly highlight code errors.

After lots of experiments, I ended up with the following setup which seems to be working as desired: More...

Using D3 charting components with ReactJS

ReactJS seems to have picked up some die hard fans and recently I was looking at how to use D3 for charting in a React based UI. There are a few implementation of some of the D3 libraries and I picked reactd3 for my experiment. The documentation site has some examples of how this can be done, but they all use ReactDOM.render to directly render to a DOM element you have put in your HTML template. My preferred approach is to not rely on existence of a predefined HTML tag in the template, but use an element which is rendered by my chart component. Here is what I ended up doing:More...