Public Repository

Last pushed: a year ago
Short Description
Docker container for Apache Spark forked from SequenceIQ
Full Description

Apache Spark on Docker

This repository contains a Docker file to build a Docker image with Apache Spark using YARN and includes Hadoop support. This image is originally based on SequenceIQ's hadoop-docker and migrated to Ubuntu 14.04.4, available on GitHub. The base Hadoop Docker image is also available as a published Docker image.

Pull the image from Docker Repository

docker pull bernieai/docker-spark:latest

Building the image

docker build --rm -t bernieai/docker-spark:latest .

Pre-flight setup

Note that you will have to address some prerequisites in Hadoop in order to deploy this image correctly. You will designate a "master" node (the container hostname must be master.cluster) that will need privileges to SSH into each slave node. This is best achieved using key-based authentication. Although some instructions here will give you background on how to operate this manually, the base image for docker-hadoop automatically generates a key for you.

Since a public key is already available on the master node, you will need to copy its contents to the ~/.ssh/authorized_keys file on other machines. Once this is completed, your cluster will be ready to log into other machines. Remember that if you restart your master node, your keys may regenerate themselves and you'll need to copy them again.

This image comes with a default key. You will find it printed in the logs via docker logs container_id. It is strongly suggested you remove the default key and replace it with your own. You will need this key in order to SSH into the container.

You will want to follow these general steps to set up your cluster:

  • Optional: set up a virtual network such as weave
  • Set up all of your slaves before starting the master
  • In one of your slaves, visit the logs to obtain your SSH key, and save it where you will need it
  • Start the master container
  • Optional: swap out the default SSH keys with your own
  • Add all slaves to the master's /etc/hosts
  • In master, start DFS and YARN using $HADOOP_PREFIX/sbin/start-yarn.sh; $HADOOP_PREFIX/sbin/start-dfs.sh

Running the image

  • if using boot2docker make sure your VM has more than 2GB memory
  • in your /etc/hosts file make sure you define 'master.cluster' to make it easier to access your sandbox UI on the master node and so slaves can access the ResourceManager

When booting up a master node with ResourceManager and NameNode:

docker run -dit -h master.cluster -e SLAVE_SIZE=7 -p 19888:19888 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 4040:4040 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -p 8020:8020 -p 9000:9000 -p 2122:2122 -p 49707:49707 bernieai/docker-spark:latest -d

When booting up a slave/worker:

docker run -dit -h slave1.cluster -p 19888:19888 -p 8030:8030 -p 8031:8031 -p 8032:8032 -p 8033:8033 -p 8040:8040 -p 8042:8042 -p 8088:8088 -p 4040:4040 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 50090:50090 -p 8020:8020 -p 9000:9000 -p 2122:2122 -p 49707:49707 bernieai/docker-spark:latest -d

Once your nodes are up and running, you can then login via SSH:

ssh -p 2122 -i your_ssh_key root@container_ip

Versions

Hadoop 2.6.0 and Apache Spark v1.6.1 on Ubuntu 14.04.4

Testing

There are two deploy modes that can be used to launch Spark applications on YARN.

YARN-client mode

In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

# run the spark shell
spark-shell \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1

# execute the the following command which should return 1000
scala> sc.parallelize(1 to 1000).count()

YARN-cluster mode

In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.

Estimating Pi (yarn-cluster mode):

# execute the the following command which should write the "Pi is roughly 3.1418" into the logs
# note you must specify --files argument in cluster mode to enable metrics
spark-submit \
--class org.apache.spark.examples.SparkPi \
--files $SPARK_HOME/conf/metrics.properties \
--master yarn-cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar

Estimating Pi (yarn-client mode):

# execute the the following command which should print the "Pi is roughly 3.1418" to the screen
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
$SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar
Docker Pull Command
Owner
bernieai

Comments (0)