My 4th of July week project was to build a Spark cluster on my home server so I can start doing experiments with PySpark. I built a small server in 2014 which I have not been utilizing recently so I decided to use that. It has 32 GB RAM, 1 TB SSD and a Quad Core Xeon processor. I decided to use the latest software, so I upgraded everything from IPMI firmware, Server firmware, to VmWare ESXI server. Then created 3 CentOs 7 VMs with 1 CPU, 8 GB ram and 50 GB SSD storage for each.
I ended up installing the following for my cluster:
- Java 8
- Hadoop 3.1.0
- Spark 2.3.1
- Python 3.6
- Jupyterhub 0.9.1
Configuring Hadoop and Spark was not too complicated. I have done Hadoop setup when Hadoop was V 0.20.0 and interestingly, it is not much different now. I followed instructions from the following sources (with slight modification) to get my Hadoop and Spark cluster up and running:
It is a lot of steps, but nothing really complicated. One big difference was "slaves" config file is now called "workers" (I guess the reason was to be politically correct).
Problem started when I decided to add Jupyterhub so I can use PySpark in Jupyter notebooks. There are some good instructions in the link above on how to configure Jupyterhub for Spark, but somehow it took me a lot of time to get everything working correctly. Having multiple versions of python resulted in some conflicts and I overlooked the fact that py4j library in my Spark version had a different name (dauh!!).
Here is what I did on my master node:
yum install yum-utils
yum groupinstall development
yum install https://centos7.iuscommunity.org/ius-release.rpm
yum install python36u
python3.6 -V
yum install python36u-pip
yum install python36u-devel
yum install npm
npm install -g configurable-http-proxy
python3.6 -m pip install jupyter
python3.6 -m pip install jupyterhub
Don't forget to install python 3.6 on all worker nodes:
yum install https://centos7.iuscommunity.org/ius-release.rpm
yum install python36u
python3.6 -V
On master node, create the PySpark kernel for Jupyter:
mkdir /usr/share/jupyter/kernels/pyspark2
nano /usr/share/jupyter/kernels/pyspark2/kernel.json
and make sure to adjust the contents of your kernel.json according to paths to your installations:
{
"argv": [
"python3.6",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python3.6 + Pyspark(Spark 2.2.0)",
"language": "python",
"env": {
"PYSPARK_PYTHON": "/usr/bin/python3.6",
"SPARK_HOME": "/home/hadoop/spark",
"HADOOP_CONF_DIR": "/home/hadoop/hadoop/etc/hadoop",
"PYTHONPATH": "/home/hadoop/spark/python/lib/py4j-0.10.7-src.zip:/home/hadoop/spark/python/",
"PYTHONSTARTUP": "/home/hadoop/spark/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"
}
}
To run Jupyterhub using sudo, I ended up doing this:
mkdir /etc/jupyterhub
sudo jupyterhub --generate-config -f /etc/jupyterhub/jupyterhub_config.py
sudo jupyterhub -f /etc/jupyterhub/jupyterhub_config.py
I noticed the first login to Jupyter works, but if I logout and try to login again, it fails. Found that you need to set the following in your jupyterhub_config.py:
c.PAMAuthenticator.open_sessions = False
At this point Jupyterhub works nicely. To run it as a service on my Centos 7 machine, I followed instructions from link above. Create a systemd service file:
nano /lib/systemd/system/jupyterhub.service
With this content:
[Unit]
Description=Jupyterhub
After=network-online.target
[Service]
User=
ExecStart=/usr/bin/jupyterhub --JupyterHub.spawner_class=sudospawner.SudoSpawner
WorkingDirectory=/etc/jupyterhub
[Install]
WantedBy=multi-user.target
And setup my service:
python3.6 -m pip install sudospawner
sudo systemctl daemon-reload
sudo systemctl enable jupyterhub
sudo systemctl start jupyterhub
01ca84ef-eebf-4782-98e8-8d201a3295cf|0|.0|96d5b379-7e1d-4dac-a6ba-1e50db561b04