Public Repository

Last pushed: 3 years ago
Short Description
Image for spark-mail tutorial at
Full Description

Short Description

Image for spark-mail tutorial at

Full Description

Running Spark Mail Project on Docker

This repository is based on SequenceIQ's Docker Spark
with a customized Spark distro and some files we need to run the Spark Mail Tutorial.
The Dockerfile starts with the sequenceiq/hadoop-docker:2.6.0 see Hadoop Docker image.

Obtaining medale/spark-mail-docker from DockerHub

sudo docker pull medale/spark-mail-docker:v1.3.1

Run spark-mail-docker image

Simple run (no shared drive)

  • -P map image ports to host port (see docker ps -l for mapping)
  • -i run in interactive mode
  • -t with tty terminal
  • -h sets hostname of the image to "sandbox"
  • medale/spark-mail-docker:v1.3.1 image and version of image
  • /etc/ - complete bootstrap
  • bash - then run bash (login as root)

    sudo docker run -P -i -t -h sandbox medale/spark-mail-docker:v1.3.1 /etc/ bash

Mounting a share drive to the image

  • -v Mount host /opt/rpm1 on image /opt/rpm1 (share files between image and host)
sudo docker run -v /opt/rpm1:/opt/rpm1 -P -i -t -h sandbox medale/spark-mail-docker:v1.3.1 /etc/ bash
> Starting sshd:                                             [  OK  ]
> Starting namenodes on [sandbox]
> sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-sandbox.out
> localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-sandbox.out

Image layout

Running the image with the bash command brings you to a shell prompt as root:

> /root
> mailrecord-utils-1.0.0-shaded.jar
hdfs dfs -ls
> Found 2 items
> -rw-r--r--   1 root supergroup  324088129 2015-03-01 22:11 enron.avro
> drwxr-xr-x   - root supergroup          0 2015-01-15 04:05 input
> SPARK_HOME=/usr/local/spark
> JAVA_HOME=/usr/java/default
> YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop
> ...

Running Spark with kryo serialization

(if you are not in /root, cd /root)
> Spark assembly has been built with Hive, including Datanucleus jars on classpath
> ...
> scala>

Analytic 1

import org.apache.spark.rdd._
import com.uebercomputing.mailparser.enronfiles.AvroMessageProcessor
import com.uebercomputing.mailrecord._
import com.uebercomputing.mailrecord.Implicits.mailRecordToMailRecordOps
val args = Array("--avroMailInput", "enron.avro")
val config = CommandLineOptionsParser.getConfigOpt(args).get
val recordsRdd = MailRecordAnalytic.getMailRecordsRdd(sc, config)
val d = recordsRdd.filter(record => record.getFrom == "")
> resN: Long = 8

Accessing Spark Web UI

Docker creates an internal IP address for the image we started. To determine
this IP address we can either do this from the host machine:

sudo docker ps
> CONTAINER ID        IMAGE                      COMMAND                CREATED             STATUS              PORTS
> bb5cf832bd76        medale/spark-mail-docker:v1.3.1   "/etc/ /   13 minutes ago      Up 13 minutes       0.0.0
sudo docker inspect --format="{{.NetworkSettings.IPAddress}}" bb5cf832bd76

Or we could run the following on the image container:

> eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:12  
          inet addr:  Bcast:  Mask:

Now we can go to the Resource Manager from local browser on host:

From there, click on ApplicationMaster (under TrackingUI column). The link goes
to something like:

Creating a host alias to avoid having to replace "sandbox"

Alternatively, on Linux, we can use the HOSTALIASES environment variable to
temporarily map sandbox to the container IP address and then run our browser
with that environment variable to translate sandbox references to the container IP:

On your host, edit host-alias with the container IP address.
For example this host-alias:



export HOSTALIASES=host-alias
# start firefox in background with that environment variable set

In the browser, go to http://sandbox:8088/. Now all http://sandbox... links
on that page should work.

Image Overview

In addition to Hadoop we have:

Spark 1.3.1 - Pre-built for Hadoop 2.6 and later.
Downloaded and added to spark-mail-docker as spark-1.3.1-bin-hadoop2.6.tgz.
Add spark-1.3.1-bin-hadoop2.6.tgz to .gitignore.

Other files not in this github repo

  • enron-small.avro - an arbitrary subset (however big you want to process) of Avro version of Enron emails.
    See Spark Mail for overview of how to
    obtain the emails and convert them to .avro format. Also see Main.scala.

Additional files

  • - put into SPARK_HOME/conf to suppresses DEBUG and INFO messages for less clutter
  • - script to start up Spark shell in yarn-client mode with Kryo

Building medale/spark-mail-docker locally

  • must copy in spark-1.3.1-bin-hadoop2.6.tgz (see above)
  • must create enron-small.avro (see above)

Add both files to docker-spark directory (same directory as Dockerfile).

sudo docker build -t medale/spark-mail-docker .
sudo docker images  #lists container id (assumed here to be e57ff7c77397)
sudo docker tag e57ff7c77397 medale/spark-mail-docker:v1.3.1

(Optional) Publish to DockerHub

After creating account on DockerHub
you can publish your image to a public repo so others can find and pull it
without having to build the image locally:

sudo docker push medale/spark-mail-docker:v1.3.1
sudo docker search medale
sudo docker pull medale/spark-mail-docker:v1.3.1

Other useful Docker commands

# Show available images
sudo docker images
> REPOSITORY                 TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
> medale/spark-mail-docker          v1.3.1              5e4665af6d6e        13 hours ago        2.84 GB
> ...

# Delete a local image
sudo docker rmi training/sinatra

# Delete a stopped instance
sudo docker ps -a -q # show all containers, just id
sudo docker rm <container id>

# Show docker container ids
sudo docker ps -l

# Commit changes to an image
sudo docker commit <container_id> medale/new_image

For background see dockerimages
and Docker builder reference.

Dockerfiles lore

ADD local.jar /some-container-location
If src is a local tar archive in a recognized compression format (identity, gzip, bzip2 or xz) then it is unpacked as a directory

FROM sequenceiq/hadoop-docker:2.6.0

# automatically untars spark-1.3.1 at /usr/local
ADD spark-1.3.1-bin-hadoop2.6.tgz /usr/local/
RUN cd /usr/local && ln -s spark-1.3.1-bin-hadoop2.6 spark
ENV SPARK_HOME /usr/local/spark
ADD /usr/local/spark/conf/

# Upload sample files and jar file
ADD enron-small.avro /root/
ADD mailrecord-utils-1.0.0-shaded.jar /root/
ADD /root/
ADD /root/
RUN chmod +x /root/

# Copy spark libs and enron email to HDFS
RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME/lib /spark && $HADOOP_PREFIX/bin/hdfs dfs -put /root/enron-small.avro /user/root/enron.avro


# Now that enron.avro is in HDFS we don't need it in local
RUN rm /root/enron-small.avro

# update boot script
COPY /etc/
RUN chown root.root /etc/
RUN chmod 700 /etc/

ENTRYPOINT ["/etc/"]

EXPOSE 4040 8080 18080
Docker Pull Command