Public Repository

Last pushed: 3 months ago
Short Description
Refer from https://bigdatagurus.wordpress.com/2017/03/01/how-to-start-spark-cluster-in-minutes/
Full Description

We are always in a hurry. Docker can help us run a lot of software without installing it. One need not waste any time in installation. Afterall, all we need to learn is the end result and not installations. So, why not let Docker take care for it.

WHAT DOCKER PROVIDES:

A closed network subnet where the docker containers can talk to each other.
A default gateway for the containers for outbound traffic.
Installed images, you literally need to do nothing.
PREREQUISITES:

Installed docker, docker-compose ( 1.9.0 or higher )
Linux box. I guess this can also be done on Windows, but that area is unexplored.
INSTALL SPARK:

Create a dir at any location. This dir name, will be used for your project.
$ mkdir ScalaCluster ; cd ScalaCluster
Copy the following content to docker-compose.yml
version: "2"

services:
master:
image: singularities/spark
command: start-spark master
hostname: master
ports:

  - "6066:6066"
  - "7070:7070"
  - "8080:8080"
  - "50070:50070"

worker:
image: singularities/spark
command: start-spark worker master
environment:
SPARK_WORKER_CORES: 1
SPARK_WORKER_MEMORY: 2g
links:

  - master

Run the following command to start the container
$ docker-compose up -d
Your 2 node cluster is up and running.
You should see something line this.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3f63c7601c93 singularities/spark "start-spark worke..." 10 minutes ago Up 10 minutes 6066/tcp, 7077/tcp, 8020/tcp, 8080-8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50070/tcp, 50470/tcp spark_worker_1
843460507f96 singularities/spark "start-spark master" 10 minutes ago Up 10 minutes 0.0.0.0:6066->6066/tcp, 7077/tcp, 0.0.0.0:7070->7070/tcp, 8020/tcp, 8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 0.0.0.0:8080->8080/tcp, 19888/tcp, 0.0.0.0:50070->50070/tcp, 50470/tcp spark_master_1
If you wish to expand your cluster to more number of nodes, you can also scale it out using the following command:
$ docker-compose scale worker=2.
You should be able to see some more docker containers as below.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
31488cf276ee singularities/spark "start-spark worke..." 7 seconds ago Up 5 seconds 6066/tcp, 7077/tcp, 8020/tcp, 8080-8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50070/tcp, 50470/tcp spark_worker_2
3f63c7601c93 singularities/spark "start-spark worke..." 10 minutes ago Up 10 minutes 6066/tcp, 7077/tcp, 8020/tcp, 8080-8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50070/tcp, 50470/tcp spark_worker_1
843460507f96 singularities/spark "start-spark master" 10 minutes ago Up 10 minutes 0.0.0.0:6066->6066/tcp, 7077/tcp, 0.0.0.0:7070->7070/tcp, 8020/tcp, 8081/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 0.0.0.0:8080->8080/tcp, 19888/tcp, 0.0.0.0:50070->50070/tcp, 50470/tcp spark_master_1

You 3 node cluster is up and running now.

HOW TO CONNECT TO YOUR CLUSTER ( SCALA ):

Spark-shell is the primary way to connect to your Spark cluster. The services are up and running, but they are inside the cluster. Hence, you will have to install the Spark-client on your local and connect using the local client. Once installed, you can connect to the docker spark cluster by providing the master connection info to spark-shell.

To connect to your master, you need to figure out the I.P on which the master container is running. To check the I.P, you can use docker inspect.

$ docker inspect 843460507f96 | grep IPAddress
"SecondaryIPAddresses": null,
"IPAddress": "",
"IPAddress": "172.18.0.2",
Use this IP address to connect using the Spark Shell. You will get a spark-shell by installing a spark client wherever you wish to. I did it by extracting the tar-ball ( Spark ver. 2.1.0 ) and simply use the spark shell to connect to it.

wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar xvzf spark-2.1.0-bin-hadoop2.7.tgz
Go to the recently extracted bin directory in from the tarball. And connect to Spark cluster it using the master I.P obtained above

$ cd spark-2.1.0-bin-hadoop2.7/bin
$ ./spark-shell --master spark://172.18.0.2:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/28 19:57:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/28 19:57:07 WARN Utils: Your hostname, mean-machine resolves to a loopback address: 127.0.0.1; using 10.20.4.35 instead (on interface eth0)
17/02/28 19:57:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/02/28 19:57:28 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/28 19:57:28 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/02/28 19:57:30 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.20.4.35:4040
Spark context available as 'sc' (master = spark://172.18.0.2:7077, app id = app-20170228142708-0000).
Spark session available as 'spark'.
Welcome to


/ / __/ /
\ \/ \/ _ `/
/ '/
/__
/ ._/\,// //\\ version 2.1.0
/_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
You are good to go ahead … I hope this helped.

Docker Pull Command
Owner
jasperbao

Comments (0)