SPARK STANDALONE CLUSTER OUT-OF-BOX
This project launches a scalable Spark 2.0 Standalone cluster based on Docker containers.
- First you'll need to build the docker image or pull it from the public repository:
docker build -t athosgvag/sprk:latest .
docker pull athosgvag/sprk
- Then you'll deploy the cluster:
nohup docker-compose -p sprk up &
- And start running your Spark jobs:
docker exec sprk01 spark-submit pathToYourScript yourArgs
- Example job:
docker exec sprk01 spark-submit /code/python/exercise.py /data/input spark://220.127.116.11:7077
- You may also run Spark commands interactively from Spark's python shell:
docker exec -it sprk01 pyspark --master spark://18.104.22.168:7077
- When you're done with your jobs, just clean everything up:
Put your scripts in
./codeand your data in
./dataso that they're accessible from inside the containers. Any output data you write to these directories will persist even after the cluster is removed.
The Spark master URL that your scripts should use to start SparkContext is spark://22.214.171.124:7077
Access http://126.96.36.199:8080 from your browser to check on your jobs' and worker nodes' status.
Wanting to scale out? Just copy the block named SLAVE NODE 02 in your docker-compose.yml file and change the container name and service name arguments from sprk02 to sprk03, sprk04, etc... Example:
... #SLAVE NODE 03 sprk03: image: athosgvag/sprk container_name: sprk03 volumes: - ./code:/code - ./data:/data command: bash -c "/code/setup/config-side-node.sh -m=188.8.131.52 -t=1200" networks: - net depends_on: - sprk01 ...
- WARNING: this cluster has 20 minutes time-to-live. You may change this with the -t argument in docker-compose.yml:
... command: bash -c "/code/setup/config-main-node.sh -m=184.108.40.206 -t=<time_to_live_in_milliseconds>" ...