What is this?
This contains a working version of Apache Spark 2.1.0 prebuilt for Hadoop 2.7, and there's no need to install any of the libraries yourself.
I have also bundled the following along with this.
▪ Apache Hadoop 2.8
▪ JDK 8u121
How do I get Docker?
If you have homebrew installed (otherwise check out http://brew.sh), just type
brew install docker
If you have not already installed docker, do the following to install docker:
wget -qO- https://www.docker.com/ | sh
Get the Docker daemon up
sudo service docker start
Run the container
sudo docker run --name my-spark -i -t ragavan/apache-spark /bin/bash
The previous would take you inside the container's terminal, from where you could simply launch Spark as follows.
For spark's REPL and a great beginning to your fun with interactive analysis, type
You will get a prompt similar to the one shown below.
Your first Spark program
I am lying here! Well, this is your first Scala program that doesn't really require Spark.
Now type the following in the prompt; what do you see?
(1 to 10).scanLeft(1)(_ * _)
Congrats! You just wrote a functional program to compute the factorial of all the numbers from 1 to 10 and store it in a list. Now let's try:
(1 to 100).scanLeft(1)(_ * _)
Wow, but wait a second, you get negative numbers in the output of a factorial? Something doesn't seem to be right. How could you fix it to work for any large numbers without getting overflow? Try the following.
(1 to 100).scanLeft(BigInt(1))(_ * _)
Food for thought!
What would it have taken to write this in Java and how could you have addressed the fix - don't you feel more productive already?
But where's the Spark code?
It's not any different from writing a Scala code. But wait and watch this space - I can hear you!