Public | Automated Build

Last pushed: 3 days ago
Short Description
Docker data science toolbox
Full Description


A one-stop-shop for doing not-so-big data science on whatever environment :)

Why ?

The whole idea is that we should be able to throw code at each-other without worrying about missing libraries, different versions of something, etc, etc...

If everybody uses this image and participate in its maintainance, we will have a super-easy to setup environment common to everyone. The "pip install"-storm after a git pull is soon over! ;)

How to use it ?


Beef up your whale

Start the Docker daemon and go to preferences. In the advanced TAB, increases the Docker ressources limit to something decent. 7 CPUs, 14 GB of RAM works well for me. Restart the Docker daemon when done.

NB: One advantage of running in a Docker container is that Python will simply crash when hitting the memory limit, instead of sending your computer into an uncontrollable swapping spree rendering your machine unusable, shortening your SSD's life and likely ending in nothing better than hard reset ;)

Download the lastest version of image

docker pull combient/docker-ds-toolbox:latest

Run the container - Jupyter

Go to the folder containing your project and run

docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:latest

This will have exactly the same result as running the jupyter notebookcommand in the same place, except that this time it will instanciate a docker container and use the python distributions and libraries from the image.

One more caveat with the image in its current state is that it won't automagically open your web browser. Instead, a URL will be displayed in the terminal. If you use iTerm, you can simply do a "cmd + click" on it to open the Jupyter file browser.

Important remark: while running the line above, the container will be chrooted to the directory from which the command has run. You can replace pwd by any path of your choice --- but it is recommended to use more limited path than "/" or "/home". The container will have full access to everything in the subfolders... and who knows what kind of code you plan to run in it ;)

Run the container - RStudio

If you prefer RStudio to Jupyter, simply change the tag in the docker image to "RStudio". I.e. run the container as

docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:RStudio

Under the hood, this is only one extra layer on top of the standard 'latest' image. This means that it takes fairly little space on your SSD if you already have the other one, and every changes made to the 'latest' image are propagated here (albeit a click on 'rebuild' on Docker hub might be needed - I don't know yet).

This means that if you want an extra R package in RStudio, you need to add it to the R package list for the 'latest' , which is counter-intuitive but allows to keep both toolboxes in sync :)

NB: The login method for RStudio is currently a little suboptimal --- you will need to copy-paste the password from the terminal instead of simply clicking a link like for Jupyter. Let's call it work in progress :)

Random tricks

  • To see the ressource utilisation in Docker, run docker stats ( instead of top)
  • If things go wrong, kill your container by running docker ps followed by docker kill container_id (instead of ps -aux and kill process_id)
  • If things go wrong, kill your container by running docker ps followed by docker kill container_id (instead of ps -aux and kill process_id)
  • To open a terminal session to your container, run docker exec -i -t container_id bash. It's a standard Debian running in there and you can do anything you would do on a Debian machine, including apt-get install or pip install . Remember that, by design, your container and all changes made in this way will disappear forever as soon as the jupyter server inside the container is shut down.

How to modify this image

  • git clone this repository
  • Switch to the dev branch
  • Do changes to the build file or helper script
  • run docker build . and make sure it works on your machine
  • commit and push your changes to github

DockerHub will then automatically rebuild the image - this can take up to 20-30 min...

  • Get your fresh image by docker pull combient/docker-ds-toolbox:dev
  • Make a test run with docker run -ti -p 8888:8888 -v "`pwd`":/data combient/docker-ds-toolbox:dev

When you are completely sure that the new image works as it should, commit your changes to the testing branch and start using it. If no one complained after a few days, make a pull request from testing to master, I'll have a look at it and decide in a completely arbitrary manner if your changes are worth spreading to the whole Combient :)

Docker Pull Command
Source Repository

Comments (0)