This container is a demonstration environment of
Spark 1.1.1 built on the
library/ubuntu:latest base images. This toolkit also provides the following versions:
- Java 1.8.0
- Scala 2.10.4
- Python 2.7
Step 1: get data
In this example we will be using a text extract of movie quotes from the IMDB. On your Docker host machine (or network storage location) download and unzip the file below into a directory accessible through the
Step 2: run the container
docker run -i -t -v /data:/data briantwalter/spark-demo:latest /start
After the container initializes some details about environment variables will print out along with the versions of tools mentioned above. As part of the container start you will also be dropped into the interactive
spark-shell and should be at a prompt like this, waiting for input.
Step 3: run some interactive examples
In interactive mode the
sc or Spark Context is provided to us for use in the statements we use. We can do things like initialize Resilient Distributed Datasets (RDD), transform RDDs , and perform actions.
RDD sources can be
HDFS, and others but since we're using a local machine and local data our first step will be to create a RDD from the text file we have in
val quotesFile = sc.textFile("/data/quotes.list")
From here we can perform some basic actions like counting the number of lines.
Printing out the first line of the file.
We can also perform some transformations such as filters. In this example we're searching the RDD for lines that contain the pattern shirley.
val shirleyCount = quotesFile.filter(line => line.contains("shirley"))
If we interrogate this value we see that there are actually no lines that match, but there are a number of great lines in the movie Airplane! we know contain this string. It must be a case sensitivity issue. We can modify our filter and add an additional method to overcome and case issues we might be encountering.
val shirleyCount = quotesFile.filter(line => line.toLowerCase.contains("shirley"))
If we want to cache this result and store it in memory for future re-use and increased I/O performance, we can simple call the