infrahelpers/dpp

By infrahelpers

Updated about 1 hour ago

Data Engineering / Data Processing Pipelines (DPP)

Image
Data Science
Machine Learning & AI
Developer Tools

50K+

Container images focusing on Data Processing Pipelines (DPP)

Docker Cloud Build Status

Overview

That project produces OCI(Docker-compliant) images, which provide environments for Data Processing Pipelines (DPP), ready to use and to be deployed on Modern Data Stack (MDS), be it on private or public clouds (e.g., AWS, Azure, GCP).

These images are based AWS-supported Corretto

These OCI images are aimed at deploying Data Engineering applications, typically Data Processing Pipelines (DPP), on Modern Data Stack (MDS)

The author of this repository also maintains general purpose cloud Python OCI images in a dedicated GitHub repository and Docker Hub space.

Thanks to Docker multi-stage builds, one can easily have in the same Docker specification files two images, one for every day data engineering work, and the other one to deploy the corresponding applications onto production environments.

The Docker images of this repository just add various utilities to make it work out of the box with cloud vendors (e.g., Azure and AWS command-line utilities) and cloud-native tools (e.g., Pachyderm), on top of the native images maintained by the AWS-supported Corretto. They also add specific Python versions.

In the OCI image, Python packages are installed by the pip utility. For testing purposes, outside of the container, Python virtual environments may be installed thanks to Pyenv and pipenv, as detailed in the dedicated procedure on the Python induction notebook sub-project.

Any additional Python module may be installed either:

  • With pip and some requirements.txt dependency specification file:
$ python3 -mpip install -r requirements.txt
  • In a dedicated virtual environment, controlled by pipenv through local Pipfile (and potentially Pipfile.lock) files, which should be versioned:
$ pipenv --rm; pipenv install; pipenv install --dev

On the other hand, the OCI images install those modules globally.

The Docker images of this repository are intended to run any Data Engineering applications / Data Processing Pipeline (DPP).

See also

Simple use

  • Download the Docker image:
$ docker pull infrahelpers/dpp:py311
  • Launch a Spark application:
$ docker run -it infrahelpers/dpp:311

Build your own container image

$ mkdir -p ~/dev/infra && cd ~/dev/infra
$ git clone https://github.com/data-engineering-helpers/dpp.git
$ cd dpp
  • Build the OCI images (here with Docker, but any other tool may be used):
    • Amazon Linux 2 for Elastic Map Reduce (EMR) 6 and DataBricks with a single Python installation, with more freedom on its version, with JDK 8:
$ docker build -t infrahelpers/cloud-python:pyspark-py311 pyspark-py311
  • Amazon Linux 2 for Elastic Map Reduce (EMR) 6 and DataBricks with a single Python installation, with more freedom on its version, with JDK 11:
$ docker build -t infrahelpers/cloud-python:pyspark-py311-jdk11 pyspark-py311-jdk11
  • In addition to what the Docker Hub builds, the CI/CD (GitHub Actions) pipeline also builds the infrahelpers/dpp images, from the pyspark-coretto-8-emr-dbs-universal-python/ directory, on two CPU architectures, namely the classical AMD64 and the newer ARM64

  • (Optional) Push the newly built images to Docker Hub. That step is usually not needed, as the images are automatically built everytime there is a change on GitHub)

$ docker login
$ docker push infrahelpers/dpp:py311
  • Choose which image should be the latest, tag it and upload it to Docker Hub:
$ docker tag infrahelpers/dpp:py311 infrahelpers/dpp:latest
$ docker push infrahelpers/dpp:latest
  • Shutdown the Docker image
$ docker ps
CONTAINER ID IMAGE                    COMMAND                   CREATED        STATUS        PORTS                  NAMES
7b69efc9dc9a de/dpp                   "/bin/sh -c 'python …"    48 seconds ago Up 47 seconds 0.0.0.0:9000->8050/tcp vigilant_merkle
$ docker kill vigilant_merkle
vigilant_merkle
$ docker ps
CONTAINER ID IMAGE                    COMMAND                   CREATED        STATUS        PORTS                  NAMES

Docker Pull Command

docker pull infrahelpers/dpp