Public | Automated Build

Last pushed: 4 months ago
Short Description
Experimental docker container for http://distributed.readthedocs.org with jupyter notebook.
Full Description

Docker provisioning of Python Distributed compute cluster

This repo hosts some sample configuration to set up docker-containerized
environments for interactive cluster computing in Python with Jupyter
notebook
and
distributed possibly in conjunction
with dask and other tools from the PyData and SciPy
ecosystems.

This docker image is meant to be run with container orchestration tools such as
Docker Swarm +
Compose or
Kubernetes, either on premise or on public clouds. The
configuration files in this repo are vendor agnostic and only rely upon Open
Source tools.

The Docker Swarm API is provided as a hosted service by:

The Kubernetes API is provided as a hosted service by:

Alternatively it is possible to install and manage Kubernetes by
your-self
.

Here is the table of contents for this documentation:

DISCLAIMER: the configuration in this repository is not secure. If you want to
use this in production, please make sure to setup an HTTPS reverse proxy instead
of exposing the Jupyter 8888 port and the distributed service ports on a public
IP address and protect the Jupyter notebook access with a password.

If you want to setup a multi-user environment you might also want to extend this
configuration to use Jupyter Hub
instead of accessing a Jupyter notebook server directly. Please note however
that the distributed workers will be run using the same unix user and therefore
this sample configuration cannot be used to implement proper multi-tenancy.

The docker-distributed image

The Dockerfile file in this repo can be used to build a docker image
with all the necessary tools to run our cluster, in particular:

  • conda and pip to install additional tools and libraries,
  • jupyter for the notebook interface accessed from any web browser,
  • dask and distributed,
  • psutil and bokeh (useful for the cluster monitoring web interface).

It is also possible to install additional tools using the conda and pip
command from within a running container.

The master branch of this github repo is synchronized with the
ogrisel/distributed:latest image on the docker registry:

https://hub.docker.com/ogrisel/distributed/

Pull and run the latest version of this container on your local docker engine as
follow:

$ docker pull ogrisel/distributed:latest
latest: Pulling from ogrisel/distributed
[...]
$ docker run -ti --rm ogrisel/distributed bash
root@37dba41caa3c:/work# ls -l examples/
total 56
-rw-rw-r-- 1 basicuser root  1344 May 11 07:44 distributed_joblib_backend.py
-rw-rw-r-- 1 basicuser root 33712 May 11 07:44 sklearn_parameter_search.ipynb
-rw-rw-r-- 1 basicuser root 14407 May 11 07:44 sklearn_parameter_search_joblib.ipynb

Alternatively it is also possible to re-build the ogrisel/distributed image
using the docker build command:

$ git clone https://github.com/ogrisel/docker-distributed
$ cd docker-distributed
$ docker build -t ogrisel/distributed .
[...]

This image will be used to run 3 types of services:

  • the jupyter notebook server,
  • the distributed scheduler service,
  • one distributed worker per-host in the compute cluster.

Those services can be easily orchestrated on public cloud providers with the one
of the following tools.

Setting up a cluster using Kubernetes

kubectl is a client tool to configure
and launch containerized services on a Kubernetes cluster. It can read the yaml
configuration files in the kubernetes/ folder of this repository.

Example setup with Google Container Engine

Register on the Google Cloud Platform, setup a
billing account and create a project with the Google Compute Engine API enabled.

Install the client SDK that includes the
gcloud command line interface or use the
in-browser Google Cloud Shell where gcloud installed.

Ensure that your client SDK is up to date:

$ gcloud components update

At the time of writing, the latest generation of Intel CPU architectures
available on GCE is the Haswell architecture. It is available only in specific
zones
. Using
modern CPU architectures is highly recommended for vector intensive CPU
workloads. In particular it makes it possible to get the most of optimized
linear algebra routines implemented by OpenBLAS and MKL internally used by NumPy
and SciPy for instance. Let us configure the zone used to provision the cluster
and create it:

$ gcloud config set compute/zone europe-west1-d
$ gcloud container clusters create cluster-1 \
    --num-nodes 3 \
    --machine-type n1-highcpu-32 \
    --scopes bigquery,storage-rw \
    --preemptible \
    --no-async
Creating cluster cluster-1...done.
Created [https://container.googleapis.com/v1/projects/ogrisel/zones/europe-west1-d/clusters/cluster-1].
kubeconfig entry generated for cluster-1.
NAME       ZONE            MASTER_VERSION  MASTER_IP        MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
cluster-1  europe-west1-d  1.2.4           130.211.103.197  n1-highcpu-32  1.2.4         3          RUNNING

The --preemptible flag makes it cheaper to run a temporary cluster for a couple
of hours at the risk of seeing it shutdown at any moment. Please refer to the
documentation
for more details.

Other machine types are
available if you would rather trade CPUs for RAM for instance. You can also
grant access to other GCP services via --scopes if you need. See the
documentation
for more details.

We further need to fetch the cluster credentials to get kubectl authorized
to connect to the newly provisioned cluster:

$ gcloud container clusters get-credentials cluster-1
Fetching cluster endpoint and auth data.
kubeconfig entry generated for cluster-1.

We can now deploy the kubernetes configuration onto the cluster:

$ git clone https://github.com/ogrisel/docker-distributed
$ cd docker-distributed
$ kubectl create -f kubernetes/
service "dscheduler" created
service "dscheduler-status" created
replicationcontroller "dscheduler" created
replicationcontroller "dworker" created
service "jupyter-notebook" created
replicationcontroller "jupyter-notebook" created

The containers are running in "pods":

$ kubectl get pods
NAME                     READY     STATUS              RESTARTS   AGE
dscheduler-hebul         0/1       ContainerCreating   0          32s
dworker-2dpr1            0/1       ContainerCreating   0          32s
dworker-gsgja            0/1       ContainerCreating   0          32s
dworker-vm3vg            0/1       ContainerCreating   0          32s
jupyter-notebook-z58dm   0/1       ContainerCreating   0          32s

we can get the logs of a specific pod with kubectl logs:

$ kubectl logs -f dscheduler-hebul
distributed.scheduler - INFO - Scheduler at:       10.115.249.189:8786
distributed.scheduler - INFO -      http at:       10.115.249.189:9786
distributed.scheduler - INFO -  Bokeh UI at:  http://10.115.249.189:8787/status/
distributed.core - INFO - Connection from 10.112.2.3:50873 to Scheduler
distributed.scheduler - INFO - Register 10.112.2.3:59918
distributed.scheduler - INFO - Starting worker compute stream, 10.112.2.3:59918
distributed.core - INFO - Connection from 10.112.0.6:55149 to Scheduler
distributed.scheduler - INFO - Register 10.112.0.6:55103
distributed.scheduler - INFO - Starting worker compute stream, 10.112.0.6:55103
bokeh.command.subcommands.serve - INFO - Check for unused sessions every 50 milliseconds
bokeh.command.subcommands.serve - INFO - Unused sessions last for 1 milliseconds
bokeh.command.subcommands.serve - INFO - Starting Bokeh server on port 8787 with applications at paths ['/status', '/tasks']
distributed.core - INFO - Connection from 10.112.1.1:59452 to Scheduler
distributed.core - INFO - Connection from 10.112.1.1:59453 to Scheduler
distributed.core - INFO - Connection from 10.112.1.4:48952 to Scheduler
distributed.scheduler - INFO - Register 10.112.1.4:54760
distributed.scheduler - INFO - Starting worker compute stream, 10.112.1.4:54760

we can also execute arbitrary commands inside the running containers with
kubectl exec, for instance to open an interactive shell session for debugging
purposes:

$ kubectl exec -ti dscheduler-hebul bash
root@dscheduler-hebul:/work# ls -l examples/
total 56
-rw-r--r-- 1 basicuser root  1344 May 17 11:29 distributed_joblib_backend.py
-rw-r--r-- 1 basicuser root 33712 May 17 11:29 sklearn_parameter_search.ipynb
-rw-r--r-- 1 basicuser root 14407 May 17 11:29 sklearn_parameter_search_joblib.ipynb

Our kubernetes configuration publishes HTTP endpoints with the LoadBalancer
type on external IP addresses (those can take one minute or two to show up):

$ kubectl get services
NAME                CLUSTER-IP       EXTERNAL-IP      PORT(S)             AGE
dscheduler          10.115.249.189   <none>           8786/TCP,9786/TCP   4m
dscheduler-status   10.115.244.201   130.211.50.206   8787/TCP            4m
jupyter-notebook    10.115.254.255   146.148.114.90   80/TCP              4m
kubernetes          10.115.240.1     <none>           443/TCP             10m

This means that is possible to point a browser to:

You can run the example notebooks from the examples/ folder via the Jupyter
interface. In particular note that the distributed scheduler of the cluster
can be reached from any node under the host name dscheduler on port 8786.
You can check by typing the following snippet in a new notebook cell:

>>> from distributed import Client
>>> c = Client('dscheduler:8786')
<Client: scheduler=dscheduler:8786 workers=3 threads=36>

Please refer to the distributed
documentation
to learn how to use the
executor interface to schedule computation on the cluster.

Once you are down with the analysis don't forget to save the results to some
external storage (for instance push your notebooks to some external git
repository). Then you can shutdown the cluster with:

$ gcloud container clusters delete cluster-1

WARNING: deploying kubernetes services with the type=LoadBalancer will cause
GCP to automatically provision dedicated firewall rules and load balancer
instances that are not automatically deleted when you shutdown the GKE cluster.
You have to delete those manually for instance by going to the "Networking" tab
of the Google Cloud Console web interface. Those additional firewall rules and
load balancers are billed on an hourly basis so don't forget to delete them
when you don't need them anymore.

Setting up a cluster using Docker Compose

docker-compose is a client tool to configure
and launch containerized services on a Docker Swarm cluster. It reads the
configuration of the cluster in the docker-compose.yml file of this
repository.

Example setup with Carina

Create an account at https://getcarina.com and follow the carina get started
instructions

to install the docker client and the carina command line tool.

Once you are setup, create a new carina cluster and configure your shell
environment variables so that your docker client can access it:

$ carina create --nodes=3 --wait cluster-1
ClusterName         Flavor              Nodes               AutoScale           Status
cluster-1           container1-4G       3                   false               active
$ eval $(carina env cluster-1)

If you installed the dvm tool, you can make sure to use a version of a the
docker client that matches the version of docker of the carina cluster by
typing:

$ dvm use
Now using Docker 1.10.3

Check that you docker client can access the carina cluster using the docker ps and docker info commands.

Install the docker-compose client:

$ pip install docker-compose

Deploy the Jupyter and distributed services as conigured in the
docker-compose.yml file of this repo:

$ git clone https://github.com/ogrisel/docker-distributed
$ cd docker-distributed
$ docker-compose up -d
Creating network "dockerdistributed_distributed" with driver "overlay"
Pulling dscheduler (ogrisel/distributed:latest)...
2fc95c02-b444-4730-bfb1-bd662e2e044e-n3: Pulling ogrisel/distributed:latest... : downloaded
2fc95c02-b444-4730-bfb1-bd662e2e044e-n1: Pulling ogrisel/distributed:latest... : downloaded
2fc95c02-b444-4730-bfb1-bd662e2e044e-n2: Pulling ogrisel/distributed:latest... : downloaded
Creating dscheduler
Creating dockerdistributed_dworker_1
Creating jupyter

Increase the number of distributed workers to match the number of nodes in the
carina cluster:

$ docker-compose scale dworker=3
Creating and starting dockerdistributed_dworker_2 ... done
Creating and starting dockerdistributed_dworker_3 ... done

Use the docker ps command to find out the public IP address and port of the
public Jupyter notebook interface (on port 8888) and the distributed monitoring
web interface
(on port
8787 possibly on a different node).

As in the kubernetes example above you can run the example notebooks from the
examples/ folder via the Jupyter interface on port 8888.

It is often useful to check that you don't get any error in the logs when
running the computation. You can access the aggregate logs of all the running
services with:

$ docker-compose logs -f
[...]

Sometimes it can also be useful to open a root shell session in the container
that run Jupyter notebook process with docker exec:

$ docker exec -ti jupyter bash
root@aff49b550f0c:/work# ls

bin  examples  miniconda  requirements.txt

When you are done with your computation, upload your results to some external
storage server or to github and don't forget to shutdown the cluster:

$ docker-compose down
$ carina rm cluster-1
$ dvm deactivate
Docker Pull Command
Owner
ogrisel
Source Repository

Comments (0)