Public Repository

Last pushed: 2 years ago
Short Description
Container with Python (Anaconda 3), Apache Spark 1.6.1, PySpark and Jupyter for PyData Berlin 2016.
Full Description

Using Spark - With PySpark Workshop, at PyData Berlin 2016

This docker image contains:

  • Spark 1.6.1
  • minimal Anaconda3
  • python3.5
  • numpy, pandas, jupyter, findspark


Pull the Git repo with the files for the workshop

git clone

This creates a local directory called pyspark-workshop which we'll use below.

Pull the image

docker pull gerhardt/pyspark-workshop

Run the image

In step 0 you cloned the workshop files to a directory called pyspark-workshop. Replace "pyspark-workshop" in the following command with the full path of the directory on your machine, e.g. /home/you/pyspark-workshop.

For Windows and Mac users, all in one line:

docker run -d --net host  --name pyspark-worker -v /home/you/pyspark-workshop:/home/user/work -e SPARK_MODE=slave -e SPARK_MASTER_ADDRESS=spark://bence:7077 -e JUPYTER_VISIBLE=true gerhardt/pyspark-workshop

For Linux users:

docker run -d --net host  --name pyspark-worker -v /home/you/pyspark-workshop:/home/user/work -e SPARK_MODE=slave -e SPARK_MASTER_ADDRESS=spark://bence:7077 gerhardt/pyspark-workshop

Open Jupyter

On Linux open http://localhost:8888.
You should see the Jupyter start page showing the contents of the pyspark-workshop directory.

On Windows and Mac you have to find the address of the VirtualBox VM.
On the command line invoke

docker-machine ip

Open http://IPADDESS:8888 with the IPADDESS from the previous step.

Run the self check

Open the notebook called "Check Set-up" and run all cells.
You should see a message at the bottom.

Docker Pull Command