Spark Single Node setup with Docker and Jupyter Notebook

 This article is a sequel of previous article titled: . Where I described how to setup spark single node cluster using docker and used the pyspark shell to run a sample spark job and track the job progress in the spark UI.

Here I have shared how to setup Jupyter notebook as an interacting layer with Apache Spark. The source code for this blog is available at github  in the branch named jupyter_notebook. The docker container is built from scratch and does not use prebuilt readymade image. 

Most of the part to build and launch the container will be same as previous blog but I have added that here too. The differentiating point is in how to connect Jupyter notebook with Spark.

Dockerfile:

This docker file is different than one in previous post, it launches jupyter notebook

CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=4041", "--no-browser", "--allow-root", "--NotebookApp.token=''" ,"--NotebookApp.password=''" ]

Build the docker image

docker build -t spark-with-jupyter .

Launch the docker container

hostfolder="$(pwd)"
dockerfolder="/home/sam/app"
docker run --rm -it \
-p 4040:4040 -p 4041:4041 \
-v ${hostfolder}:${dockerfolder} \
--entrypoint bash spark-with-jupyter:latest

Connecting Jupyter notebook with Apache spark.

Go to the internet browser and enter . It will launch the Jupyter environment as given below. 

Now create a new Jupyter notebook or open the existing one named first_notebook.ipynb

— Now run some basic python code to make sure that python and Jupyter notebook is setup properly.
— Then run following code to find spark and create spark session.

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# create spark session
spark = SparkSession.builder.appName("SparkSample").getOrCreate()

# read text file
df_text_file = spark.read.text("textfile.txt")
df_text_file.show()

df_total_words = df_text_file.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df_total_words.show()

# Word count example
df_word_count = df_text_file.withColumn('word', f.explode(f.split(f.col('value'), ' '))).groupBy('word').count().sort('count', ascending=False)
df_word_count.show()

Screenshot of running above code:

So now you are ready to do run your spark code and do the exploration using jupyter notebook.

Similar to part-1 Setup Single Node Spark Cluster using Docker  you can use docker-compose or a shell script as you wish. I will suggest to follow the github link and readme file to get latest information. And setup a rocking spark cluster. 

In next blogs I will bring spark multi-node cluster with airflow, Data Lake and other complex systems, with DevOps to automate the code deployment process. Feel free to share your thoughts and feedback in comment section.

Post a Comment

Previous Post Next Post