Public | Automated Build

Last pushed: 3 years ago
Short Description
Short description is empty for this repo.
Full Description


Python tool to run docker containers in sequence, for reproducible computational analysis pipelines.


Installing and running via Docker is recommended.

Docker Image

  1. Pull the docker-pipeline image (Optional, docker run will automatically pull the image)

     docker pull dukegcb/docker-pipeline


  1. Clone this GitHub repository

     git clone
  2. Install dependencies with pip

     cd docker-pipeline
     pip install -r requirements.txt

Write a Pipeline

Pipelines are defined by YAML files. They specify which Docker images to run, and what files/parameters to provide to containers at runtime them. Each step must specify an image, which can be a name (e.g. dleehr/filesize), or an Image ID (10baa1fc6046).
If the image accepts environment variables, specify these as infiles, outfiles, and parameters. More on that below

name: Total Size
    name: Get size of file 1
    image: dleehr/filesize
      CONT_INPUT_FILE: /data/raw/file1
      CONT_OUTPUT_FILE: /data/step1/size
    name: Get size of file 2
    image: dleehr/filesize
      CONT_INPUT_FILE: /data/raw/file2
      CONT_OUTPUT_FILE: /data/step2/size
    name: Add sizes of files
    image: dleehr/add
      CONT_INPUT_FILE1: /data/step1/size
      CONT_INPUT_FILE2: /data/step2/size
      CONT_OUTPUT_FILE: /data/step3/total_size

Paths to files should be specified as absolute paths on the Docker host. See Files and Volumes for more details.



Note: You do not need to clone this repository to run docker-pipeline. You only need a working Docker installation, and the docker-pipeline image will be pulled automatically

docker run \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /path/to/total_size.yaml:/pipeline.yaml:ro \
  dukegcb/docker-pipeline /pipeline.yaml

Or using /path/to/total_size.yaml

Since docker-pipeline creates and starts docker containers, it must have access to /var/run/docker.sock on the host. It also must have access to the YAML file, which is mounted as /pipeline.yaml in this example.


python /path/to/total_size.yaml

Advanced Usage

docker-pipeline extends YAML with custom tags, allowing simple file name operations and access to variables specified at runtime. This is handy with file names, which make little sense to hard-code in a config file.

For example, file names in the above pipeline can be replaced as such:


And this pipeline can be run with:

python total_size.yaml \
  FILE1=/data/raw/file1 \
  FILE2=/data/raw/file2 \

The following tags are available, see for details:

  • !var: replace a variable specified on the command-line
  • !join: concatenate multiple values into a single string. Useful for building up file paths
  • !change_ext: change the extension of a filename
  • !base: get the base file name (remove parent directories) from a path

Tags can be chained together. See for exmaples. In some cases.

Connecting to Docker uses docker-py to communicate with Docker. It has been tested with Boot2Docker on OS X, provided $(boot2docker shellinit) has been executed. It also works on Docker hosts connecting locally.

Files and Volumes

Without explicit access provided at runtime, docker containers cannot access filesystems or paths outside the container. docker-pipeline handles this transparently, allowing you to reference files named in your pipeline, with some added safety features.

First, each container only gets volume access for directories specified within infiles and outfiles. docker-pipeline will mount only the innermost subdirectory when specifying volume mounts. Keep in mind that the containers do get root access to these directories, so do NOT place your data to analyze in / or otherwise sensitive directories.

Second, volumes created for infiles are mounted read-only, and volumes for outfiles are mounted read-write. This prevents containers from modifying source data for their step. Of course, one step outfile may be another container's infile. This mechanism allows the file to be written when it's an outfile. Ideally, raw data would only ever be passed as an infile, so it should be protected.

Docker Pull Command
Source Repository