Public Repository

Last pushed: 7 months ago
Short Description
Apache Spark 2.1.0 for Hadoop 2.7
Full Description

What is this?

This contains a working version of Apache Spark 2.1.0 prebuilt for Hadoop 2.7, and there's no need to install any of the libraries yourself.

I have also bundled the following along with this.
▪ Apache Hadoop 2.8
▪ JDK 8u121
▪ sbt
▪ tmux
▪ vim
▪ git
▪ wget

How do I get Docker?

Mac

If you have homebrew installed (otherwise check out http://brew.sh), just type
brew install docker

Ubuntu

If you have not already installed docker, do the following to install docker:
wget -qO- https://www.docker.com/ | sh

Get the Docker daemon up

sudo service docker start

What next?

Run the container

sudo docker run --name my-spark -i -t ragavan/apache-spark /bin/bash

The previous would take you inside the container's terminal, from where you could simply launch Spark as follows.

Launch Spark

For spark's REPL and a great beginning to your fun with interactive analysis, type
spark-shell

You will get a prompt similar to the one shown below.
scala>

Your first Spark program

I am lying here! Well, this is your first Scala program that doesn't really require Spark.

Now type the following in the prompt; what do you see?
(1 to 10).scanLeft(1)(_ * _)

Congrats! You just wrote a functional program to compute the factorial of all the numbers from 1 to 10 and store it in a list. Now let's try:

(1 to 100).scanLeft(1)(_ * _)

Wow, but wait a second, you get negative numbers in the output of a factorial? Something doesn't seem to be right. How could you fix it to work for any large numbers without getting overflow? Try the following.

(1 to 100).scanLeft(BigInt(1))(_ * _)

Food for thought!

What would it have taken to write this in Java and how could you have addressed the fix - don't you feel more productive already?

But where's the Spark code?

It's not any different from writing a Scala code. But wait and watch this space - I can hear you!

Docker Pull Command
Owner
ragavan

Comments (0)