Massoud Mazar

Sharing The Knowledge


Correct way of setting up Jupyter Kernels for Spark

In my post few days ago, I provided an example for kernel.json file to get PySpark working with Jupyter notebooks. Then I realized magics like %%sql are not working for me. It turns out I was missing some other configuration and code which is already provided by SparkMagic library. Their GitHub repository has great instructions on how to install it, but since it took me a little bit to get it to work, I'm sharing what I learned.

sparkmagic dependencies

First step was to install sparkmagic using PIP, but soon I got some errors due to other missing libraries. I ended up installing the following before I could install sparkmagic, but your system may already have these:

yum install krb5-devel
python2.7 -m pip install decorator --upgrade

Then I was able to install sparkmagic: (I did it for both python 2 and 3, but not sure if it was necessary)

python2.7 -m pip install sparkmagic
python3.6 -m pip install sparkmagic

In my case, I was getting a new error:

cannot import name 'DataError'

This is caused by incompatibility of latest version of pandas library (0.23.0) with sparkmagic. After downgrading pandas to 0.22.0, things started working:

python2.7 -m pip install pandas==0.22.0
python3.6 -m pip install pandas==0.22.0

Make sure to follow instructions on the sparkmagic GitHub page to setup and configure it. It already creates the kernels needed for Spark and PySpark, and even R.

Environment Variables

Another issue I had to fix was to correctly define PYTHONPATH environment variable. If you see my previous post, I was defining a bunch of these path variables in the kernel.json file, but since I'm now using sparkmagic, I had to make sure needed variables are defined somewhere else. Looking at my "~/.bash_profile" I noticed the only missing variable (in compare to my old kernel.json) was PYTHONPATH, so I added it:

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/

Running on Yarn

If instead of local mode, Yarn is used to run the spark job from your notebook, add the following to your .sparkmagic/config.json:

    "livy_session_startup_timeout_seconds": 60,
    "livy_server_heartbeat_timeout_seconds": 0,
    "session_configs": {
        "driverMemory": "2G",
        "conf": {
            "spark.master": "yarn-cluster"

Oh, BTW, you need Livy

sparkmagic needs Livy to talk to spark, so if you do not have Livy installed, do that first. I installed Livy in the same /home/hadoop folder where I installed hadoop and spark:

mv livy-0.5.0-incubating-bin livy

If running in Yarn mode, modify livy.conf:

mv livy/conf/livy.conf.template livy/conf/livy.conf
nano livy/conf/livy.conf

and set correct execution mode:

livy.spark.master = yarn-cluster

And finally start livy:

cd livy
./bin/livy-server start


Add comment