linksApache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud.
Docker images are the basis of containers. Images are read-only, while containers are writeable. Only the containers can be executed by the operating system.
Branch | Base Image | Description |
master | gelog/java:openjdk7 | Spark pre-built for Hadoop |
spark-for-hadoop | " " | Spark pre-built for Hadoop (dev branch) |
spark-from-source | scala:2.10.4 | Spark built from source |
Note: currently the spark-from-source image takes quite a while to build, and generates 2.3 GB of virtual size.
The recommended branch for general use is master.
docker run -d -h spark-master --name spark-master gelog/spark:1.1.0-bin-hadoop2.3 \
spark-class org.apache.spark.deploy.master.Master
docker run -d -h spark-worker-01 --name spark-worker-01 --link spark-master:spark-master \
gelog/spark:1.1.0-bin-hadoop2.3 spark-class org.apache.spark.deploy.worker.Worker \
docker pull gelog/spark