This article is a sequel of previous article titled: Create Spark Single Node Cluster With Docker. Where I described how to setup spark single node cluster using docker and used the pyspark shell to run a sample spark job and track the job progress in the spark UI.

Here I have shared how to setup Jupyter notebook as an interacting layer with Apache Spark. The source code for this blog is available at github here in the branch named jupyter_notebook. The docker container is built from scratch and does not use prebuilt readymade image. https://github.com/experientlabs/spark_playground/tree/main/spark-single-node

Most of the part to build and launch the container will be same as previous blog but I have added that here too. The differentiating point is in how to connect Jupyter notebook with Spark.

Dockerfile:

https://github.com/experientlabs/spark_playground/blob/main/spark-single-node/Dockerfile

This docker file is different than one in previous post, it launches jupyter notebook

CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=4041", "--no-browser", "--allow-root", "--NotebookApp.token=''" ,"--NotebookApp.password=''" ]

Build the docker image

docker build -t spark-with-jupyter .

Launch the docker container

hostfolder="$(pwd)"
dockerfolder="/home/sam/app"
docker run --rm -it \
  -p 4040:4040 -p 4041:4041 \
  -v ${hostfolder}:${dockerfolder} \
--entrypoint bash spark-with-jupyter:latest

Connecting Jupyter notebook with Apache spark.

Go to the internet browser and enter https://localhost:4041. It will launch the Jupyter environment as given below. localhost:4041

Now create a new Jupyter notebook or open the existing one named first_notebook.ipynb

— Now run some basic python code to make sure that python and Jupyter notebook is setup properly.
— Then run following code to find spark and create spark session.

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# create spark session
spark = SparkSession.builder.appName("SparkSample").getOrCreate()

# read text file
df_text_file = spark.read.text("textfile.txt")
df_text_file.show()

df_total_words = df_text_file.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df_total_words.show()

# Word count example
df_word_count = df_text_file.withColumn('word', f.explode(f.split(f.col('value'), ' '))).groupBy('word').count().sort('count', ascending=False)
df_word_count.show()

Screenshot of running above code:

So now you are ready to do run your spark code and do the exploration using jupyter notebook.

Similar to part-1 Setup Single Node Spark Cluster using Docker you can use docker-compose or a shell script as you wish. I will suggest to follow the github link and readme file to get latest information. And setup a rocking spark cluster.

In next blogs I will bring spark multi-node cluster with airflow, Data Lake and other complex systems, with DevOps to automate the code deployment process. Feel free to share your thoughts and feedback in comment section.

Spark Single Node setup with Docker and Jupyter Notebook

Dockerfile:

Build the docker image

Launch the docker container

Connecting Jupyter notebook with Apache spark.

Post a Comment

Setup Single Node Spark Cluster using Docker

Asynchronous Programming with Python's asyncio

Contact Form