Public | Automated Build

Last pushed: a month ago
Short Description
pyspark on docker.
Full Description


This Docker image helps you to run Spark (on Docker) with the following

  1. Apache Spark 2.1.1
    • running on Hadoop 2.7.3 and Java openjdk version "1.8.0_92-internal"
  2. Python 3.4.3
  3. Spark's python interface

If you change the version of pytest, you need to add git-tag to the commit.
For instance, if the version of Hadoop is 2.7.3 and version of Spark is 2.1.1, you need to execute the following commands.

git tag -a spark-2.1.1-hadoop-2.7.3
git push origin spark-2.1.1-hadoop-2.7.3

How to install

On Mac OS X

1. Install homebrew

2. Install docker and launch docker daemon

brew cask install docker

Launch the application, and make sure it displays "Docker is running".

On other OSes

Follow the installation guide from the official docker guide.

Starting pyspark

On any OS

1. Pull the docker image

docker pull makotonagai/docker-pyspark:latest

2. Start the container

Run the following command to start the container and get a bash prompt

docker run -it makotonagai/docker-pyspark:latest /bin/bash

3. Start pyspark

./bin/pyspark  # open an interactive python shell with SparkContext as sc

To quit the interpreter, hit <Ctrl> + D.

How to run a cluster of containers with Docker Compose

docker-compose.yml example files

cd example
docker-compose up  # launch cluster (Ctrl-C to stop)

The SparkUI will be running at http://${YOUR_DOCKER_HOST}:8080 with one
worker listed. To run pyspark, exec into a container:

docker exec -it example_master_1 /bin/bash


Docker image, makotonagai/pyspark:pytest-3.1.2, doesn't supports spark history server but docker image, makotonagai/pyspark:spark-history-server, does.
The source of the image is maintained in spark-history-server branch.

Update docker image of makotonagai/pyspark:spark-history-server

Just modify the Dockerfile in spark-history-server branch.
Then push the branch to the GitHub.
DockerHub automatically builds the branch.

How to run spark-history-server

To visualize log from spark history server, just put the logs into $(pwd)/logs and run docker with the following commands.

docker run --rm -d \
  -p 18080:18080 \
  --volume $(PWD)/logs:/tmp/spark-events \
  makotonagai/pyspark:spark-history-server \
  bash /tmp/scripts/

Then open http://localhost:18080/ with your browser.
The scripts is included in directory.

(OPTIONAL) Building the docker image yourself

You can build this docker image, by running the following command in
the same directory as this =README= file. The command will be slow (a
few minutes) the first time, since all dependencies need to be fetched and
compiled from source, but the result is then cached. This step should
only be necessary if you modify the Dockerfile.

docker build -t docker-pyspark .


If you are unable to access HDFS from pyspark, try running pyspark with the
--master yarn flag.

If you are unable to access the HTTP SparkUI, verify that the open ports are
redirected from your virtual machine to your host machine. Under VirtualBox,
see the machine's Settings > Network > Port Forwarding.

Docker Pull Command
Source Repository