Using Spark - With PySpark Workshop, at PyData Berlin 2016
This docker image contains:
- Spark 1.6.1
- minimal Anaconda3
- numpy, pandas, jupyter, findspark
Pull the Git repo with the files for the workshop
git clone https://gitlab.com/gerhardt.io/pyspark-workshop.git
This creates a local directory called pyspark-workshop which we'll use below.
Pull the image
docker pull gerhardt/pyspark-workshop
Run the image
In step 0 you cloned the workshop files to a directory called pyspark-workshop. Replace "pyspark-workshop" in the following command with the full path of the directory on your machine, e.g. /home/you/pyspark-workshop.
For Windows and Mac users, all in one line:
docker run -d --net host --name pyspark-worker -v /home/you/pyspark-workshop:/home/user/work -e SPARK_MODE=slave -e SPARK_MASTER_ADDRESS=spark://bence:7077 -e JUPYTER_VISIBLE=true gerhardt/pyspark-workshop
For Linux users:
docker run -d --net host --name pyspark-worker -v /home/you/pyspark-workshop:/home/user/work -e SPARK_MODE=slave -e SPARK_MASTER_ADDRESS=spark://bence:7077 gerhardt/pyspark-workshop
On Linux open http://localhost:8888.
You should see the Jupyter start page showing the contents of the pyspark-workshop directory.
On Windows and Mac you have to find the address of the VirtualBox VM.
On the command line invoke
Open http://IPADDESS:8888 with the IPADDESS from the previous step.
Run the self check
Open the notebook called "Check Set-up" and run all cells.
You should see a message at the bottom.