Spark, RStudio and Shiny servers in a docker cluster hosted by Carina
This repository contains the necessary files for setting up a Spark cluster based on Docker in which Rstudio and Shiny servers are included. This cluster can be hosted in a docker environment such as Carina based on Docker Swarm. Moreover, a toy shiny application is included to test its functionality.
SparkR is an R package designed to use Apache Spark from the R command line and to allow distributed computations on large datasets. Importantly, the distributed machined learning library MLlib can be utilized in SparkR. For training proposes, SparkR can be run in "standalone mode," which means using a single node, probably your own computer. My personal experience is that not all the programs or applications developed in standalone mode will work in a fully integrated cluster mode. Therefore, SparkR should be deployed in a cluster to obtain its full potential.
Even though AWS, Google Cloud, Microsoft Azure and others providers offer interesting quotes, it would be better if SparkR could be run in a cloud environment for free.
Docker is an open source project to automatically deploy applications into "containers". These containers are based on images which contain a root file system and several execution parameters to constitute an independent virtualized operating system. From the docker website: "The concept is borrowed from Shipping Containers, which define a standard to ship goods globally. Docker defines a standard to ship software".
Carina is a docker environment based on Docker Swarm and it can be used to deploy an application using docker containers in a cluster. Each cluster of Carina is composed of 3 nodes with a memory capacity of 4 GB and 12 vCPUs each, thus, every cluster has total 12 GBs of RAM and 36 vCPUs. Carina offered free accounts at the time when this file was written (20/07/2016). For more details, go to the Carina website.
Furthermore, RStudio and Shiny servers can be hosted simultaneously in the same cluster to test our SparkR applications and even publish them.
To get started in 15 minutes, follow the subsequent instructions. For a more detailed description, go to here.
Sign up for the Carina Beta here.
Create a Carina cluster and scale up to 3 nodes
Connect to your Carina cluster as explained in here.
If everything runs smoothly, you should see something like this after the
docker infocommand :
$ docker info Containers: 5 Running: 3 Paused: 0 Stopped: 2 Images: 5 Server Version: swarm/1.2.0 Role: primary Strategy: spread Filters: health, port, dependency, affinity, constraint Nodes: 1 1dba0f72-75bc-4825-a5a0-b2993c535599-n1: 126.96.36.199:42376 └ Status: Healthy └ Containers: 5 └ Reserved CPUs: 0 / 12 └ Reserved Memory: 0 B / 4.2 GiB └ Labels: com.docker.network.driver.overlay.bind_interface=eth1, executiondriver=, kernelversion=3.18.21-2-rackos, operatingsystem=Debian GNU/Linux 7 (wheezy) (containerized), storagedriver=aufs └ Error: (none) └ UpdatedAt: 2016-05-27T19:27:24Z └ ServerVersion: 1.11.2
Run the following commands:
## Define a network docker network create spark_network ## Create data volume container with a folder to share among the nodes docker create --net spark_network --name data-share \ --volume /home/rstudio/share angelsevillacamins/spark-rstudio-shiny ## Deploy master node docker run -d --net spark_network --name master \ -p 8080:8080 -p 8787:8787 -p 80:3838 \ --volumes-from data-share \ --restart=always \ angelsevillacamins/spark-rstudio-shiny /usr/bin/supervisord --configuration=/opt/conf/master.conf ## Changing permissions in the share folder of the data volume docker exec master chmod a+w /home/rstudio/share ## Deply worker01 node docker run -d --net spark_network --name worker01 \ --volumes-from data-share \ --restart=always \ angelsevillacamins/spark-rstudio-shiny /usr/bin/supervisord --configuration=/opt/conf/worker.conf ## Changing permissions in the share folder of the data volume docker exec worker01 chmod a+w /home/rstudio/share ## Deploy worker02 node docker run -d --net spark_network --name worker02 \ --volumes-from data-share \ --restart=always \ angelsevillacamins/spark-rstudio-shiny /usr/bin/supervisord --configuration=/opt/conf/worker.conf ## Changing permissions in the share folder of the data volume docker exec worker02 chmod a+w /home/rstudio/share
After each docker run command, you should see the volume name such as:
Check master external IP with the following command:
or go to the Carina Clusters page and press Edit Cluster. The IP should be in the Containers description of your master node:
8787 → 146.20.00.00:8787 8080 → 146.20.00.00:8080 3838 → 146.20.00.00:80
Launch your favorite web browser and use the previous addresses, taking into account that:
This is work in progress. For collaborations or feedback, you can contact me by email at firstname.lastname@example.org.
The files in this repository are licensed under the Apache License, Version 2.0. RStudio Server and Shiny Server are licensed under the AGPL v3.
RStudio and Shiny are trademarks of RStudio, Inc. The use of the trademarked terms RStudio and Shiny and the distribution of the RStudio binaries through the images hosted on hub.docker.com has been granted by explicit permission of RStudio. Please review RStudio's trademark use policy and address inquiries about further distribution or other questions to email@example.com.