This article is a sequel of previous article titled: Create Spark Single Node Cluster With Docker. Where I described how to setup spark single node cluster using docker and used the pyspark shell to run a sample spark job and track the job progress in the spark UI.
Here I have shared how to setup Jupyter notebook as an interacting layer with Apache Spark. The source code for this blog is available at github here in the branch named jupyter_notebook. The docker container is built from scratch and does not use prebuilt readymade image. https://github.com/experientlabs/spark_playground/tree/main/spark-single-node
Most of the part to build and launch the container will be same as previous blog but I have added that here too. The differentiating point is in how to connect Jupyter notebook with Spark.
Dockerfile:
https://github.com/experientlabs/spark_playground/blob/main/spark-single-node/Dockerfile
This docker file is different than one in previous post, it launches jupyter notebook
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=4041", "--no-browser", "--allow-root", "--NotebookApp.token=''" ,"--NotebookApp.password=''" ]Build the docker image
docker build -t spark-with-jupyter .Launch the docker container
hostfolder="$(pwd)"
dockerfolder="/home/sam/app"
docker run --rm -it \
-p 4040:4040 -p 4041:4041 \
-v ${hostfolder}:${dockerfolder} \
--entrypoint bash spark-with-jupyter:latest
Connecting Jupyter notebook with Apache spark.
Go to the internet browser and enter https://localhost:4041. It will launch the Jupyter environment as given below. localhost:4041

Now create a new Jupyter notebook or open the existing one named first_notebook.ipynb

— Now run some basic python code to make sure that python and Jupyter notebook is setup properly.
— Then run following code to find spark and create spark session.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
# create spark session
spark = SparkSession.builder.appName("SparkSample").getOrCreate()
# read text file
df_text_file = spark.read.text("textfile.txt")
df_text_file.show()
df_total_words = df_text_file.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df_total_words.show()
# Word count example
df_word_count = df_text_file.withColumn('word', f.explode(f.split(f.col('value'), ' '))).groupBy('word').count().sort('count', ascending=False)
df_word_count.show()Screenshot of running above code:


So now you are ready to do run your spark code and do the exploration using jupyter notebook.
Similar to part-1 Setup Single Node Spark Cluster using Docker you can use docker-compose or a shell script as you wish. I will suggest to follow the github link and readme file to get latest information. And setup a rocking spark cluster.
In next blogs I will bring spark multi-node cluster with airflow, Data Lake and other complex systems, with DevOps to automate the code deployment process. Feel free to share your thoughts and feedback in comment section.