evenco/datascience-base
Base image for data science projects
1M+
NOTE: Since the dockerfile currently keeps a static reference to the latest version of Java, this project will not build if a new version of the JDK is released under our feet. Until we have a better solution, go here to find the latest version and URL parameters.
Data science-friendly Docker container
Goals:
⚠️ All new external dependencies should be vetted by the security team. ⚠️
Our python requirements are managed by pipenv, which provides a way to get repeatable installations of python dependencies. Pipenv is both fairly new in the python ecosystem, and fairly opinionated, but for our particular use case, it actually works pretty well.
This way we don't have to manually manage a huge tree of dependencies, but we get all the advantages of completely pinned requirements which will never change under our feet.
Here is an overview of the process for installing, testing, and merging a new python package, more detailed instructions follow this section:
./dev_local.sh
)pipenv install package
)Pipenv.lock
reflects only the changes you want to happen.evenco/docker-datascience
evenco/even-server
by changing the FROM evenco/datascinece-base
line in the Dockerfiles in evenco/even-server
and pushing a branch to GitHub. CircleCI will then test this branch. Link to the successful tests in
your PR to this repo.You should complete these steps inside a docker container which mimics the production Even environment, although it should be fairly reliable to perform the same steps locally, you might want to try working in a docker container. You can launch an appropriate container using ./dev_local.sh
.
If you mess up and want to start over, you can remove the virtualenv managed by pipenv using pipenv --rm
.
Follow this procedure to add a new package and update all dependencies:
Decide whether this is a dev requirement.
Add the requirement. The easiest way to do this is to use pipenv install requirement
. Use pipenv install --dev requirement
if it is a dev requirement.
pipenv install requirement==0.0.0
Edge case: also add any required 'optional' dependencies of your requirement. These cannot be added automatically by the subsequent steps. For example:
fs
does not explicitly depend on fs-s3fs
but you need both to open S3 buckets.pandas
optimizes certain functions if numexpr
is installeddask
only runs its diagnostic dashboard if bokeh
is installed.If this succeeds, many requirements are likely to change. You should check the changes to Pipfile.lock
and see if there are dependencies which you don't want to upgrade. If this is the case, you should add them, pinned to their previous version, with a comment in the Pipfile
. Ideally, you will allow requirements which move by minor or bugfix updates to update, and test them against CI. However, if you think that a critical library might break something in production, it is safer to pin it the in Pipfile
.
pipenv graph
can show you why specific dependencies are added in the places they landed. If changes are undesireable, revert those parts of the Pipfile.lock
.
Any changes need to be tested to ensure everything still works.
This is a particularly useful method if you're working somewhere with a slow internet connection.
evenco/datascience-base:your_tag-name
Run ./build_local.sh. This tags the new image as evenco/datascience-base
so that subsequent builds in even-server
use your local image.
Unless you're very sure you won't need to (e.g. you just added a library to be used in notebooks), you should check that the server and scholar will still work. There's no exhaustive process for this, but there are two options:
After Dockerhub builds your docker-datascience, make a branch of even-server
where all Dockerfile references to evenco/datascience-base
point to evenco/datascience-base:your_tag-name
. Push this branch to github, and Circle CI will build even-server using the new base image and run all of our tests on it.
Go with this option if there is a chance that tests in CI may not be sufficient to catch errors, thus requiring extra manual testing.
First, locally rebuild all downstream images which will be affected by the changes (e.g. dc build scholar scholar.common scholar-notebook.dev scholar.test
). You may need to build with --no-cache
, and stop
/rm
previous containers. Then:
run-tests scholar
, run-tests api
and so on.dc up api
), start even-client in the iOS simulator, make an account, check scholar's logs (dc logs -f scholar
) to make sure there are no errors.dask
or distributed
changed, check that ETL tools still work.
dc up scholar.notebook
)PRs to this repo should be reviewed both by the data team and by the security team (see #help-security on slack).
Do not forget to do this.
The Domino Even Base environment must be updated to pick up changes to this repo. Even Base is pinned to a specific commit hash of docker-datascience, so all that has to be updated in the dockerfile is the hash.
y
keygit checkout
command.After the PR is merged into master and has been built and published, any new production/staging releases of even-server will use the new image - specifically, they will use the most recent image tagged as 'latest'.
Don't merge code which relies on these changes to even-server master until the image has been built and published. When even-server is next deployed to staging, keep an eye on it to make sure nothing breaks before deploying to production.
To use the new image locally, run even-server
's ./scripts/install
.
Use the script compare-requirement.py
to compare two requirements files, showing version differences between shared packages, and packages present in only one file.
docker pull evenco/datascience-base