Docker image for Hadoop
Why do I built this images?
I need a Hadoop image that:
- Was built to run with Java 8
- Allows to customize Hadoop configuration at runtime
- Can be used to run both Hadoop server or Hadoop client.
There are some Docker images for Hadoop already but I could not find one that has the features that I need so I decided to built this image.
Which features does this Docker image prodive:
- Hadoop that is installed in psuedo-distributed mode
Runtime customizable options (over environment variables):
- Namenode hostname. Default is localhost
- mapreduce.framework.name. Default is yarn
- mapreduce.map.memory.mb. Default is 512m
- yarn.app.mapreduce.am.resource.mb. Default is 512m
- yarn.resourcemanager.hostname. Default is 0.0.0.0
- yarn.nodemanager.delete.debug-delay-sec. Default is 600
- yarn.scheduler.minimum-allocation-mb. Default is 32m
- yarn.scheduler.maximum-allocation-mb. Default is 1024
- yarn.nodemanager.resource.memory-mb. Default is 2048m
- yarn.nodemanager.vmem-check-enabled. Default is false
Applied best practices to build image to reduce its size (~900m right now and it is much smaller compare to some other Hadoop docker images)
How to use
Server and client on one container
# Start a Hadoop server on a container named docker run -d --name hadoop binhnv/docker-hadoop # Login to Hadoop container docker exec -it hadoop bash # Create a directory /usr/local/hadoop/bin/hadoop fs -mkdir /test # List directory /usr/local/hadoop/bin/hadoop fs -ls /
Server and client on separated containers
Create a Dockerfile for Hadoop client like this
FROM binhnv/docker-hadoop CMD dhcmd config && tail -f /dev/null
Create a Docker Compose configuration file like this
version: "2" services: hadoops: image: binhnv/docker-hadoop hadoopc: build: . environment: HD_NAMENODE_HOST: "hadoops" YARN_RESOURCEMANAGER_HOSTNAME: "hadoops"
Bring up the stack
docker-compose up -d --build
Now you can login into the client container and run all hadoop commands there