infrahelpers/dpp
Data Engineering / Data Processing Pipelines (DPP)
50K+
That project produces OCI(Docker-compliant) images, which provide environments for Data Processing Pipelines (DPP), ready to use and to be deployed on Modern Data Stack (MDS), be it on private or public clouds (e.g., AWS, Azure, GCP).
These images are based AWS-supported Corretto
These OCI images are aimed at deploying Data Engineering applications, typically Data Processing Pipelines (DPP), on Modern Data Stack (MDS)
The author of this repository also maintains general purpose cloud Python OCI images in a dedicated GitHub repository and Docker Hub space.
Thanks to Docker multi-stage builds, one can easily have in the same Docker specification files two images, one for every day data engineering work, and the other one to deploy the corresponding applications onto production environments.
The Docker images of this repository just add various utilities to make it work out of the box with cloud vendors (e.g., Azure and AWS command-line utilities) and cloud-native tools (e.g., Pachyderm), on top of the native images maintained by the AWS-supported Corretto. They also add specific Python versions.
In the OCI image, Python packages are installed by the pip
utility.
For testing purposes, outside of the container, Python virtual environments
may be installed thanks to Pyenv and pipenv
, as detailed in the
dedicated procedure
on the
Python induction notebook sub-project.
Any additional Python module may be installed either:
pip
and some requirements.txt
dependency specification file:$ python3 -mpip install -r requirements.txt
pipenv
through
local Pipfile
(and potentially Pipfile.lock
) files,
which should be versioned:$ pipenv --rm; pipenv install; pipenv install --dev
On the other hand, the OCI images install those modules globally.
The Docker images of this repository are intended to run any Data Engineering applications / Data Processing Pipeline (DPP).
$ docker pull infrahelpers/dpp:py311
$ docker run -it infrahelpers/dpp:311
$ mkdir -p ~/dev/infra && cd ~/dev/infra
$ git clone https://github.com/data-engineering-helpers/dpp.git
$ cd dpp
$ docker build -t infrahelpers/cloud-python:pyspark-py311 pyspark-py311
$ docker build -t infrahelpers/cloud-python:pyspark-py311-jdk11 pyspark-py311-jdk11
In addition to what the Docker Hub builds, the CI/CD (GitHub Actions)
pipeline also builds the infrahelpers/dpp
images,
from the
pyspark-coretto-8-emr-dbs-universal-python/
directory,
on two CPU architectures, namely the classical AMD64 and the newer ARM64
(Optional) Push the newly built images to Docker Hub. That step is usually not needed, as the images are automatically built everytime there is a change on GitHub)
$ docker login
$ docker push infrahelpers/dpp:py311
$ docker tag infrahelpers/dpp:py311 infrahelpers/dpp:latest
$ docker push infrahelpers/dpp:latest
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7b69efc9dc9a de/dpp "/bin/sh -c 'python …" 48 seconds ago Up 47 seconds 0.0.0.0:9000->8050/tcp vigilant_merkle
$ docker kill vigilant_merkle
vigilant_merkle
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
docker pull infrahelpers/dpp