dcagatay/spark
A debian:buster
based Spark container. Use it in a standalone cluster with the accompanying docker-compose.yml
, or as a base for more complex recipes.
latest
, 3.0.1-hadoop3.3.0
3.0.1-hadoop2.10.2
3.0.1-hadoop3.1.4
3.0.1-hadoop3.2.2
2.4.7-hadoop2.10.2
2.4.7-hadoop3.1.4
2.4.7-hadoop3.2.2
2.4.7-hadoop3.2.2-scala2.12
2.4.7-hadoop3.3.0
You can find the all tags at Docker Hub.
docker-compose.yml
provides a single-master spark cluster, you can start the cluster via docker-compose up -d
.
You can find the highly available version at docker-compose-ha.yml
.
Traditional Spark configuration is possible, you can configure according to Spark configuration documentation.
In order to configure via $SPARK_HOME/conf
directory, you should map /conf
directory of container to a local one, and put your configuration inside that directory for master and worker separately. Note that, if the mapped directory is empty, it is self populated with templates and current configuration for convenience.
There are a couple of configuration extensions for increasing usability via environment variables.
SPARK_MODE
: (Required for worker) Running mode for the spark instance. Required for worker. ([worker
, master
], Default: master
)SPARK_MASTER_URI
: (Required for worker) Tells worker nodes where which master to connect on startup. (e.g. spark://m1:7077,spark://m2:7077
)ZK_HOSTS
: (Optional) For HA configuration, enables to set zookeeper hosts quickly.(e.g. zk1:2181,zk2:2181,zk3:2181
)You can access more configuration options from here and here
After running single-master or highly available version you can run spark examples.
docker-compose exec m1 bin/run-example SparkPi 100
You can submit PySpark jobs via master.
docker-compose exec m1 bash -c 'echo -e "import pyspark\n\nprint(pyspark.SparkContext().parallelize(range(0, 10)).count())" > /tmp/count.py'
docker-compose exec m1 bin/spark-submit /tmp/count.py
You need to consider following items if you want to run jobs from another machine.
docker-compose.yml
. (e.g. /etc/hosts
)SPARK_PUBLIC_DNS
and SPARK_LOCAL_HOSTNAME
environment variables according to your DNS settings.job-submit.sh
You can build your custom images according to your Spark and Hadoop version requirements by just changing the SPARK_VERSION
and HADOOP_VERSION
environment variables and docker build -t your-custom-spark .
command.
MIT
Extended from gettyimages/docker-spark
docker pull dcagatay/spark