Public | Automated Build

Last pushed: a year ago
Short Description
Apache Spark container
Full Description

Docker container for spark (standalone cluster)

Starting up

This repository contains a set of scripts and configuration files to run a Apache Spark standalone cluster from Docker container.

To start master, workers or shell (on the same docker host), you can invoke the following commands:

# Spawn master
docker run -d --name spark-master dennybaa/spark master
master_ip=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' spark-master)

# Spawn workers
docker run -d --name spark-worker0 dennybaa/spark worker spark://$master_ip:7077
docker run -d --name spark-worker1 dennybaa/spark worker spark://$master_ip:7077

# Spawn shell
docker run --rm -it dennybaa/spark shell --master spark://$master_ip:7077

Configuration

There's a set of environment variables and command line arguments which can be passed to the spark container (daemon). Containers can be started by both providing an environment variable and a command line argument, the former has higher priority. Namely, when SPARK_MASTER_WEBUI_PORT=8080 is given and --webui-port 8090 is passed, this will result in that master webui will start on port 8090.

Container managed variables: SPARK_MASTER_IP, SPARK_WORKER_IP. This variables override spark defaults and bind services inside a container to all addresses (0.0.0.0).

Details about standalone configuration mode can be found in the documentation. The full list of all spark related environment variables can be also glanced at spark-env.sh.template.

Mind that there is no need to pass --properties-file since the default one is already provided spark-defaults.conf, if you might want to change the default values just pass /spark-defaults.conf volume to the container.

Spark Master

Spark master command line usage:

$ docker run --rm -it dennybaa/docker-spark master --help
Usage: Master [options]

Options:
  -i HOST, --ip HOST     Hostname to listen on (deprecated, please use --host or -h)
  -h HOST, --host HOST   Hostname to listen on
  -p PORT, --port PORT   Port to listen on (default: 7077)
  --webui-port PORT      Port for web UI (default: 8080)
  --properties-file FILE Path to a custom Spark properties file.
                         Default is conf/spark-defaults.conf.

For the list of available master environment options and additional details please refer to documentation!

Spark Worker

CLI usage:

$ docker run --rm -it dennybaa/docker-spark worker --help
Usage: Worker [options] <master>

Master must be a URL of the form spark://hostname:port

Options:
  -c CORES, --cores CORES  Number of cores to use
  -m MEM, --memory MEM     Amount of memory to use (e.g. 1000M, 2G)
  -d DIR, --work-dir DIR   Directory to run apps in (default: SPARK_HOME/work)
  -i HOST, --ip IP         Hostname to listen on (deprecated, please use --host or -h)
  -h HOST, --host HOST     Hostname to listen on
  -p PORT, --port PORT     Port to listen on (default: random)
  --webui-port PORT        Port for web UI (default: 8081)
  --properties-file FILE   Path to a custom Spark properties file.
                           Default is conf/spark-defaults.conf.

For the list of available worker environment options and additional details please refer to documentation!

Mind that SPARK_WORKER_INSTANCES is not applicable to container, if you need to start several workers you need to start several containers.

Other

If you are planning to use spark shell, it's advised to look at Zeppelin, it could be used instead of spark shell for working with data. It has pleasant GUI and IPython like functionality.

Docker Pull Command
Owner
dennybaa
Source Repository

Comments (0)