Public | Automated Build

Last pushed: 5 months ago
Short Description
Short description is empty for this repo.
Full Description

Spark client docker image


This repository contains a docker image to run Apache Spark client.

To run simple spark shell :

docker run -it epahomov/docker-spark:lightweighted /spark/bin/spark-shell

To run simple python spark shell (known as pyspark) :

docker run -it epahomov/docker-spark:lightweighted /spark/bin/pyspark

Examples before used lightweighted version of this image. It's very small, so it would download very fast, but it's not very flexible. All next examples would be with default version

To run simple spark R shell :

docker run -it epahomov/docker-spark /spark/bin/sparkR

To run simple spark sql shell :

docker run -it epahomov/docker-spark /spark/bin/spark-sql

To run simple spark shell with some changed properties like here :

docker run -it epahomov/docker-spark /spark/bin/spark-shell  --master local[4]

To run simple spark shell with changed spark-defaults.conf do:

printf "spark.master local[4] \nspark.executor.cores 4" > spark-defaults.conf
sudo docker run -v $(pwd)/spark-defaults.conf:/spark/conf/spark-defaults.conf -it epahomov/docker-spark /spark/bin/spark-shell

First line write conf into file spark-defaults.conf, and second line mount this file from host file system to filesystem in container and puts it in conf directory.

To be able to use spark ui, add " -p 4040:4040 " argument:

docker run -ti -p 4040:4040 epahomov/docker-spark /spark/bin/spark-shell

To run some python script do:

echo "import pyspark\nprint(pyspark.SparkContext().parallelize(range(0, 10)).count())" > count.py
docker run -it -p 4040:4040 -v $(pwd)/count.py:/count.py epahomov/docker-spark /spark/bin/spark-submit /count.py

Hadoop

With this image you can connect to Hadoop cluster from spark. All you need is specify HADOOP_CONF_DIR and pass directory with hadoop configs as volume

docker run -v $(pwd)/hadoop:/etc/hadoop/conf -e "HADOOP_CONF_DIR=/etc/hadoop/conf" --net=host  -it epahomov/docker-spark /spark/bin/spark-shell --master yarn-client

Versions

This container exists in next versions:

  • java_8_spark_2.0.2_hadoop_2.7
  • java_8_spark_2.0.2_hadoop_2.6
  • java_8_spark_2.1.0_hadoop_2.7
  • java_8_spark_2.1.0_hadoop_2.6
  • lightweighted - lightweighted version of this image. It's based on alpine linux and downloaded binary, not build from source with all possible plags(like -Pyarn).
  • old-spark - Old functionality with setting up spark cluster. Not supported, not recommended to use.

Master has version java_8_spark_2.1.0_hadoop_2.7

Zeppelin

This image is base image for Apache Zeppelin Image

Docker Pull Command
Owner
epahomov
Source Repository

Comments (3)
dgreene3p
2 years ago

There's an 'issue' with the spark code that doesn't identify hostnames with underscores - if you rename the master to 'sparkmaster' - the workers can connect. Not sure why, but repeatable via scala console (commands mirror spark codebase commands in Utils.scala):

val str = "spark://spark_master:7077"
val uri = new java.net.URI(str);
val host = uri.getHost
host: String = null

val str = "spark://sparkmaster:7077"
val uri = new java.net.URI(str);
val host = uri.getHost
host: String = sparkmaster

shuja
2 years ago

Hi,
I am executing without changing any file but getting the same exception as mentioned in the below comment and here are the details of exception.

6/03/16 12:49:56 INFO Utils: Successfully started service 'sparkWorker' on port 8888.
Exception in thread "main" org.apache.spark.SparkException: Invalid master URL: spark://:
at org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:1981)
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:879)
at org.apache.spark.deploy.worker.Worker$$anonfun$12.apply(Worker.scala:551)
at org.apache.spark.deploy.worker.Worker$$anonfun$12.apply(Worker.scala:551)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.worker.Worker$.startSystemAndActor(Worker.scala:551)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:529)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)

guzu92
2 years ago

Hi
Thank you very much for this image !
I've built it modifying the Dockerfile simply by replacing "..spark-1.3.0 .." by "..spark-1.6.0..". The master runs OK, and sparkR runs OK (after installing R).
But I get an "org.apache.spark.SparkException: Invalid master URL: spark://:" error when running the worker and spark-shell, respectively. Belo for the worker, can you help ? thanks in adavance !

root@c06ef30a0511:/# ./start-worker.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/07 11:25:13 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
16/01/07 11:25:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/07 11:25:14 INFO SecurityManager: Changing view acls to: root
16/01/07 11:25:14 INFO SecurityManager: Changing modify acls to: root
16/01/07 11:25:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/01/07 11:25:14 INFO Utils: Successfully started service 'sparkWorker' on port 8888.
Exception in thread "main" org.apache.spark.SparkException: Invalid master URL: spark://:
at org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2121)
at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
at org.apache.spark.deploy.worker.Worker$$anonfun$12.apply(Worker.scala:712)
at org.apache.spark.deploy.worker.Worker$$anonfun$12.apply(Worker.scala:712)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:712)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:692)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)