Public | Automated Build

Last pushed: 38 minutes ago
Short Description
Docker data science toolbox
Full Description

docker-ds-toolbox

A one-stop-shop for doing not-so-big data science on whatever environment :)

Why ?

The whole idea is that we should be able to throw code at each-other without worrying about missing libraries, different versions of something, etc, etc...

If everybody uses this image and participate in its maintainance, we will have a super-easy to setup environment common to everyone. The "pip install"-storm after a git pull is soon over! ;)

How to use it ?

Install

https://www.docker.com/

Beef up your whale

Start the Docker daemon and go to preferences. In the advanced TAB, increases the Docker ressources limit to something decent. 7 CPUs, 14 GB of RAM works well for me. Restart the Docker daemon when done.

NB: One advantage of running in a Docker container is that Python will simply crash when hitting the memory limit, instead of sending your computer into an uncontrollable swapping spree rendering your machine unusable, shortening your SSD's life and likely ending in nothing better than hard reset ;)

Download the lastest version of image

docker pull combient/docker-ds-toolbox:latest

Run the container - Jupyter

Go to the folder containing your project and run

docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:latest

This will have exactly the same result as running the jupyter notebookcommand in the same place, except that this time it will instanciate a docker container and use the python distributions and libraries from the image.

One more caveat with the image in its current state is that it won't automagically open your web browser. Instead, a URL will be displayed in the terminal. If you use iTerm, you can simply do a "cmd + click" on it to open the Jupyter file browser.

Important remark: while running the line above, the container will be chrooted to the directory from which the command has run. You can replace pwd by any path of your choice --- but it is recommended to use more limited path than "/" or "/home". The container will have full access to everything in the subfolders... and who knows what kind of code you plan to run in it ;)

If you are using Spark and want to access the SparkUI, you need to add ports for that. Please use

docker run -ti -p 8888:8888 -p 4040-4049:4040-4049 -v "`pwd`":/data combient/docker-ds-toolbox:latest

The first SparkSession will have its UI on port 4040, thus we need to bind this one to the host. If you want to run more than one Session (non-recommended) or start a second one before the first is properly closed (it will happen!), Spark will try ports above 4040 in an order not totally obvious to me. In the above, Docker will bind 10 consecutive ports to the host. Feel free to increase the range if that is not sufficient!

Please also refer to the Templates folder an example on how to properly configure your SparkSession for this image.

Run the container - RStudio

If you prefer RStudio to Jupyter, simply change the tag in the docker image to "RStudio". I.e. run the container as

docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:RStudio

Under the hood, this is only one extra layer on top of the standard 'latest' image. This means that it takes fairly little space on your SSD if you already have the other one, and every changes made to the 'latest' image are propagated here (albeit a click on 'rebuild' on Docker hub might be needed - I don't know yet).

This means that if you want an extra R package in RStudio, you need to add it to the R package list for the 'latest' , which is counter-intuitive but allows to keep both toolboxes in sync :)

NB: The login method for RStudio is currently a little suboptimal --- you will need to copy-paste the password from the terminal instead of simply clicking a link like for Jupyter. Let's call it work in progress :)

Run the container - Batch jobs

Run a python script as a batch job. Remember that you have to give the as it looks inside the container, not on your local disk!

docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:latest /usr/bin/python3 /data/relative/path/to/my/file.py

docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:latest /usr/bin/python2 /data/relative/path/to/my/file.py

Random tricks

  • To see the ressource utilisation in Docker, run docker stats ( instead of top)
  • If things go wrong, kill your container by running docker ps followed by docker kill container_id (instead of ps -aux and kill process_id)
  • If things go wrong, kill your container by running docker ps followed by docker kill container_id (instead of ps -aux and kill process_id)
  • To open a terminal session to your container, run docker exec -i -t container_id bash. It's a standard Debian running in there and you can do anything you would do on a Debian machine, including apt-get install or pip install . Remember that, by design, your container and all changes made in this way will disappear forever as soon as the jupyter server inside the container is shut down.

Adding packages to a running container

For testing purposes or one-of runs

Python packages

From inside Jupyter, create a block

!pip install my_package

System packages

Inside Jupyter, create a block

%%sh
apt-get update
apt-get install -y my_package

How to modify this image

  • git clone this repository
  • Switch to the dev branch
  • Do changes to the build file or helper script
  • run docker build . and make sure it works on your machine
  • commit and push your changes to github

DockerHub will then automatically rebuild the image - this can take up to 20-30 min...

  • Get your fresh image by docker pull combient/docker-ds-toolbox:dev
  • Make a test run with docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:dev

When you are completely sure that the new image works as it should, commit your changes to the testing branch and start using it. If no one complained after a few days, make a pull request from testing to master, I'll have a look at it and decide in a completely arbitrary manner if your changes are worth spreading to the whole Combient :)

Docker Pull Command
Owner
combient
Source Repository