Public Repository

Last pushed: 2 years ago
Short Description
Apache Spark 1.6.0 cluster Docker image
Full Description

Apache Spark 1.6.0 cluster Docker image

This image is built on top of, so please read its documentation first

Git repository of the project is available on GitHub

Build the image

Before you build, please download the foloowing: Scala 2.10 and Apache Spark 1.6.0.

curl -LO
curl -LO

If you'd like to try directly from the Dockerfile you can build the image as:

sudo docker build -t sfedyakov/spark-160-cluster .

Start cluster

docker-compose scale namenode=1 datanode=2

You may want to change replication factor to something greater than 1. That's super-easy!

for z in $(docker ps -q) ; do docker exec $z sed -i 's/<value>1/<value>2/' /usr/local/hadoop/etc/hadoop/hdfs-site.xml ; done

First, prepare test data

docker exec -it namenode /bin/bash --login
hdfs dfs -mkdir -p /tmp/wnp/input/ 
curl -LO 
hdfs dfs -put pg2600.txt /tmp/wnp/input/ 

Now run Spark shell

spark-shell --master yarn

Count words in War and Peace

val lines = sc.textFile("/tmp/wnp/input/pg2600.txt")
lines.flatMap(l => l.split(" ")).map(w => (w.toLowerCase, 1)).reduceByKey(_+_).sortBy(z => (z._2, z._1)).collect().takeRight(50)


Same as

Docker Pull Command