Public Repository

Last pushed: 3 years ago
Short Description
Apache Spark Demo for Local Machines
Full Description

This container is a demonstration environment of Spark 1.1.1 built on the library/ubuntu:latest base images. This toolkit also provides the following versions:

  • Java 1.8.0
  • Scala 2.10.4
  • Python 2.7

Step 1: get data

In this example we will be using a text extract of movie quotes from the IMDB. On your Docker host machine (or network storage location) download and unzip the file below into a directory accessible through the /data path.


Step 2: run the container

docker run -i -t -v /data:/data briantwalter/spark-demo:latest /start

After the container initializes some details about environment variables will print out along with the versions of tools mentioned above. As part of the container start you will also be dropped into the interactive spark-shell and should be at a prompt like this, waiting for input.


Step 3: run some interactive examples

In interactive mode the sc or Spark Context is provided to us for use in the statements we use. We can do things like initialize Resilient Distributed Datasets (RDD), transform RDDs , and perform actions.

RDD sources can be S3, HDFS, and others but since we're using a local machine and local data our first step will be to create a RDD from the text file we have in /data/quotes.list.

val quotesFile = sc.textFile("/data/quotes.list")

From here we can perform some basic actions like counting the number of lines.


Printing out the first line of the file.


We can also perform some transformations such as filters. In this example we're searching the RDD for lines that contain the pattern shirley.

val shirleyCount = quotesFile.filter(line => line.contains("shirley"))

If we interrogate this value we see that there are actually no lines that match, but there are a number of great lines in the movie Airplane! we know contain this string. It must be a case sensitivity issue. We can modify our filter and add an additional method to overcome and case issues we might be encountering.

val shirleyCount = quotesFile.filter(line => line.toLowerCase.contains("shirley"))

If we want to cache this result and store it in memory for future re-use and increased I/O performance, we can simple call the cache() method.

Docker Pull Command