Have you ever wondered how to setup spark cluster and tweak various settings like a real production environment without spending a dime on costly clouds. Many of you might always have thought about this while starting your career in the field of data engineering

Many people still think that setting up a full-fledged Spark cluster can be a daunting task, requiring multiple machines and intricate configurations. Thankfully, there’s an accessible alternative that allows you to experiment and develop with Spark without the complexity of a full cluster: And that solution is Docker.

So I will teach you how to setup your own personal spark cluster on the comfort of your laptop. I have written this two part blog post that describes how to setup a single node spark cluster and then run spark code using jupyter notebook :
— Part-1 Setting up spark cluster with docker (Spark Shell)
— Part-2 Setting up spark cluster with docker (Jupyter Notebook)

Prerequisite:

Basic Understanding of Docker. That you can get from internet if you are totally naive about what is going on in the tech world.

Let's start:

Let's start to setup a fully functional Spark cluster up and running on a single machine, enabling you to experiment with Spark’s features, test your applications, and gain valuable insights into big data processing.

You can find the code on GitHub. In this article we will create the spark setup with following steps:

Create Dockerfile to create a spark container using Dockerfile
Create docker-compose.yml file to create container using docker-compose command
Launch the spark-shell or jupyter notebook to run pyspark code.

Before we start let’s have a look at the spark architecture:
This is a multi node spark architecture for an standard spark application.

But for the sake of our setup, we are deviating from this standard architecture and we will keep only one machine that will work as both Driver and Executor as shown below.

Dockerfile: It contains set of instruction to create our spark cluster. You can always get our latest docker file from github repo here

FROM python:3.10.9-buster

RUN apt-get -y update
RUN apt-get -y upgrade
RUN apt-get -y install tree
ENV PIPENV_VENV_IN_PROJECT=1

# ENV PIPENV_VENV_IN_PROJECT=1 is important: it causes the resuling virtual environment to be created as /app/.venv. Without this the environment gets created somewhere surprising, such as /root/.local/share/virtualenvs/app-4PlAip0Q - which makes it much harder to write automation scripts later on.

RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir pipenv
RUN pip install --no-cache-dir jupyter
RUN pip install --no-cache-dir py4j
RUN pip install --no-cache-dir findspark

#############################################
# install java and spark and hadoop
# Java is required for scala and scala is required for Spark
############################################

# VERSIONS
ENV SPARK_VERSION=3.2.4 \
HADOOP_VERSION=3.2 \
JAVA_VERSION=11

RUN apt-get update --yes && \
    apt-get install --yes --no-install-recommends \
    "openjdk-${JAVA_VERSION}-jre-headless" \
    ca-certificates-java  \
    curl && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

RUN java --version

# DOWNLOAD SPARK AND INSTALL
RUN DOWNLOAD_URL_SPARK="https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
    && wget --verbose -O apache-spark.tgz  "${DOWNLOAD_URL_SPARK}"\
    && mkdir -p /home/spark \
    && tar -xf apache-spark.tgz -C /home/spark --strip-components=1 \
    && rm apache-spark.tgz

# SET SPARK ENV VARIABLES
ENV SPARK_HOME="/home/spark"
ENV PATH="${SPARK_HOME}/bin/:${PATH}"

# Fix Spark installation for Java 11 and Apache Arrow library
# see: https://github.com/apache/spark/pull/27356, https://spark.apache.org/docs/latest/#downloading
RUN cp -p "${SPARK_HOME}/conf/spark-defaults.conf.template" "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.driver.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.executor.extraJavaOptions -Dio.netty.tryReflectionSetAccessible=true' >> "${SPARK_HOME}/conf/spark-defaults.conf"

############################################
# create group and user
############################################

ARG UNAME=sam
ARG UID=1000
ARG GID=1000


RUN cat /etc/passwd

# create group
RUN groupadd -g $GID $UNAME

# create a user with userid 1000 and gid 1000
RUN useradd -u $UID -g $GID -m -s /bin/bash $UNAME
# -m creates home directory

# change permissions of /home/sam to 1000:100
RUN chown $UID:$GID /home/sam


###########################################
# add sudo
###########################################

RUN apt-get update --yes
RUN apt-get -y install sudo
RUN apt-get -y install vim
RUN cat /etc/sudoers
RUN echo "$UNAME ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN cat /etc/sudoers

#############################
# spark history server
############################

# ALLOW spark history server (mount sparks_events folder locally to /home/sam/app/spark_events)

RUN echo 'spark.eventLog.enabled true' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.eventLog.dir file:///home/sam/app/spark_events' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.history.fs.logDirectory file:///home/sam/app/spark_events' >> "${SPARK_HOME}/conf/spark-defaults.conf"

RUN mkdir /home/spark/logs
RUN chown $UID:$GID /home/spark/logs

###########################################
# change working dir and user
###########################################

USER $UNAME

RUN mkdir -p /home/$UNAME/app
WORKDIR /home/$UNAME/app
# CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=4041", "--no-browser", "--allow-root", "--NotebookApp.token=''" ,"--NotebookApp.password=''" ]
CMD ["sh", "-c", "tail -f /dev/null"]

In order to launch the container, follow instructions below or the latest info in my github repo here

First build image from docker file using ‘docker build’ command

docker build -t spark-in-docker .

Then run the image using ‘docker run’ command.


hostfolder="$(pwd)"
dockerfolder="/home/sam/app"
docker run --rm -it \
  -p 4040:4040 \
  -v ${hostfolder}:${dockerfolder} \
--entrypoint bash spark-in-docker:latest

Now you can launch spark shell by running `pyspark` as shown below:

Let’s run a simple spark application to test our spark setup.


import pyspark.sql.functions as f

df_text_file = spark.read.text("textfile.txt")
df_text_file.show()

df_total_words = df_text_file.withColumn('wordCount', f.size(f.split(f.col('value'), ' ')))
df_total_words.show()

df_word_count = df_text_file.withColumn('word', f.explode(f.split(f.col('value'), ' '))).groupBy('word').count().sort('count', ascending=False)
df_word_count.show()

You can access the sparkUI at localhost:4040

2. docker-compose:

version: '3'
services:
  single-node:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 4040:4040
    volumes:
      - ./app:/home/sam/app

To build the docker image run below command

docker-compose build

To run the docker run the `docker-compose up` command as given below.

docker-compose up -d

3. Automation Script to build and launch the container:

#!/bin/bash

build() {
  docker-compose build
}

run() {
  docker-compose up -d
}

build_and_run() {
  build
  run
}

# Check the command-line argument
if [[ $# -eq 0 ]]; then
  echo "Usage: start.sh [build | run | build_and_run]"
  exit 1
fi

# Execute the requested function based on the command-line argument
case $1 in
  "build") build ;;
  "run") run ;;
  "build_and_run") build_and_run ;;
  *) echo "Invalid argument: $1. Usage: start.sh [build | run | build_and_run]" ;;
esac

Source Code:

Refer the github repo with source code and readme document for this blog here:https://github.com/experientlabs/spark_playground/tree/main/spark-single-node

Thank you, and feel free to post your views, queries and feeback in the comment section.

Setup Single Node Spark Cluster using Docker

Prerequisite:

Let's start:

Enough Introduction, Let's get our hands dirty:

Post a Comment

Setup Single Node Spark Cluster using Docker

Asynchronous Programming with Python's asyncio

Contact Form