evenco/datascience-base

By evenco

Updated almost 6 years ago

Base image for data science projects

Image
1

1M+

docker-datascience

NOTE: Since the dockerfile currently keeps a static reference to the latest version of Java, this project will not build if a new version of the JDK is released under our feet. Until we have a better solution, go here to find the latest version and URL parameters.

Data science-friendly Docker container

Goals:

  • Has all of the powerful interactive goodies that data scientists love.
  • Can run all production Even Python code.

Adding a new Python package

⚠️ All new external dependencies should be vetted by the security team. ⚠️

Our python requirements are managed by pipenv, which provides a way to get repeatable installations of python dependencies. Pipenv is both fairly new in the python ecosystem, and fairly opinionated, but for our particular use case, it actually works pretty well.

This way we don't have to manually manage a huge tree of dependencies, but we get all the advantages of completely pinned requirements which will never change under our feet.

Process Overview

Here is an overview of the process for installing, testing, and merging a new python package, more detailed instructions follow this section:

  1. Enter a production-like docker container (i.e. run ./dev_local.sh)
  2. Install the package following the steps below (shortcut: pipenv install package)
  3. Check that Pipenv.lock reflects only the changes you want to happen.
  4. Build docker-datascience on dockerhub by pushing a branch containing your changes to evenco/docker-datascience
  5. Test your changes in evenco/even-server by changing the FROM evenco/datascinece-base line in the Dockerfiles in evenco/even-server and pushing a branch to GitHub. CircleCI will then test this branch. Link to the successful tests in your PR to this repo.
  6. Get your PR reviewed and merged.
  7. Update our domino environment to match the changes you made to this repo.

1. Installing the Package

You should complete these steps inside a docker container which mimics the production Even environment, although it should be fairly reliable to perform the same steps locally, you might want to try working in a docker container. You can launch an appropriate container using ./dev_local.sh.

If you mess up and want to start over, you can remove the virtualenv managed by pipenv using pipenv --rm.

Follow this procedure to add a new package and update all dependencies:

  1. Decide whether this is a dev requirement.

    • If it will be used by production systems, it is a regular requirement.
    • If it will just be a tool used in Python notebooks or Domino, or local scripts, it is a dev requirement.
    • The distinction is currently only organizational. All requirements will be installed anyway.
  2. Add the requirement. The easiest way to do this is to use pipenv install requirement. Use pipenv install --dev requirement if it is a dev requirement.

    • Do not pin the requirement unless necessary.
    • If the requirement must be pinned, justify with a comment in the Pipfile after you run e.g. pipenv install requirement==0.0.0
  3. Edge case: also add any required 'optional' dependencies of your requirement. These cannot be added automatically by the subsequent steps. For example:

    • fs does not explicitly depend on fs-s3fs but you need both to open S3 buckets.
    • pandas optimizes certain functions if numexpr is installed
    • dask only runs its diagnostic dashboard if bokeh is installed.

2. Understanding Changes

If this succeeds, many requirements are likely to change. You should check the changes to Pipfile.lock and see if there are dependencies which you don't want to upgrade. If this is the case, you should add them, pinned to their previous version, with a comment in the Pipfile. Ideally, you will allow requirements which move by minor or bugfix updates to update, and test them against CI. However, if you think that a critical library might break something in production, it is safer to pin it the in Pipfile.

pipenv graph can show you why specific dependencies are added in the places they landed. If changes are undesireable, revert those parts of the Pipfile.lock.

3. Building the Docker Image

Any changes need to be tested to ensure everything still works.

3a. Build docker-datascience remotely

This is a particularly useful method if you're working somewhere with a slow internet connection.

  1. Commit and push your changes
  2. Dockerhub will build the new image with a tag specific to the branch. Wait for this build to complete.
  3. Edit Dockerfile.local in even-server to point to evenco/datascience-base:your_tag-name
3b. Build docker-datascience locally

Run ./build_local.sh. This tags the new image as evenco/datascience-base so that subsequent builds in even-server use your local image.

4. Testing

Unless you're very sure you won't need to (e.g. you just added a library to be used in notebooks), you should check that the server and scholar will still work. There's no exhaustive process for this, but there are two options:

4a. Testing on CI

After Dockerhub builds your docker-datascience, make a branch of even-server where all Dockerfile references to evenco/datascience-base point to evenco/datascience-base:your_tag-name. Push this branch to github, and Circle CI will build even-server using the new base image and run all of our tests on it.

4b. Testing locally

Go with this option if there is a chance that tests in CI may not be sufficient to catch errors, thus requiring extra manual testing.

First, locally rebuild all downstream images which will be affected by the changes (e.g. dc build scholar scholar.common scholar-notebook.dev scholar.test). You may need to build with --no-cache, and stop/rm previous containers. Then:

  • Run the even-server tests, e.g. run-tests scholar, run-tests api and so on.
  • Start the API (dc up api), start even-client in the iOS simulator, make an account, check scholar's logs (dc logs -f scholar) to make sure there are no errors.
  • If requirements related to dask or distributed changed, check that ETL tools still work.
    • This can be done in a notebook, or by installing and testing the updated requirements in Domino.
  • Test new data science libraries in a local notebook (dc up scholar.notebook)

5. Review and Merge your PR

PRs to this repo should be reviewed both by the data team and by the security team (see #help-security on slack).

6. Keeping the Even Base environment in Domino in sync

Do not forget to do this.

The Domino Even Base environment must be updated to pick up changes to this repo. Even Base is pinned to a specific commit hash of docker-datascience, so all that has to be updated in the dockerfile is the hash.

  1. Get the commit ID: there is an easy way to do this is in Github.
    • navigate to the commit in github (e.g. master, or perhaps a branch if you are testing changes)
    • press the y key
    • copy the hash out of your address bar
  2. Update Even Base: Click "Edit Dockerfile" in Domino. Under the Python dependencies section, replace the hash in the git checkout command.
  3. Check the 'extra packages': Later on in the Dockerfile there is a step where we install extra 'domino-only' packages. Take some time to check whether we still need all of them, or whether any of them are now being installed in datascience-base.
  4. Build the environment: Add a descriptive revision summary e.g. "updated to latest docker-datascience" or "latest docker-datascience with new package X" and click the Build button.

Deployment

After the PR is merged into master and has been built and published, any new production/staging releases of even-server will use the new image - specifically, they will use the most recent image tagged as 'latest'.

Don't merge code which relies on these changes to even-server master until the image has been built and published. When even-server is next deployed to staging, keep an eye on it to make sure nothing breaks before deploying to production.

To use the new image locally, run even-server's ./scripts/install.

Other tools

Use the script compare-requirement.py to compare two requirements files, showing version differences between shared packages, and packages present in only one file.

Docker Pull Command

docker pull evenco/datascience-base