Public | Automated Build

Last pushed: 3 years ago
Short Description
Short description is empty for this repo.
Full Description


This is a fork of ingensi/docker-hadoop-cdh-pseudo

A basic Cloudera pseudo cluster running HDFS and YARN with Spark support.

This image is intended for development and continuous integration of Hadoop/Spark jobs, not for critical or production environments.

Quick Start

Launch a hadoop cluster, setting the hostname for your cluster is important for communication between linked containers.

docker run -d -P --hostname hadoop --name hadoop galloplabs/hadoop-cdh-pseudo

The container takes a few minutes to come up but once it's ready you should be able to connect to HDFS.

docker run -i -t --rm --link hadoop:hadoop galloplabs/hadoop-cdh-pseudo hdfs dfs -ls hdfs://hadoop:8020/

Run mapreduce jobs with YARN. Note that we need use our script which replaces key variables in the hadoop configs based on the HADOOP_HOST environment variable.

docker run -i -t --rm --link hadoop:hadoop -e HADOOP_HOST=hadoop galloplabs/hadoop-cdh-pseudo /root/ yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 16 100

Verify the logs of a YARN job.

docker run -i -t --rm --link hadoop:hadoop -e HADOOP_HOST=hadoop galloplabs/hadoop-cdh-pseudo /root/ yarn logs -applicationId application_1425487566328_0003

Run Spark jobs with YARN.

docker run -i -t --link hadoop:hadoop -e HADOOP_HOST=hadoop galloplabs/hadoop-cdh-pseudo /root/ spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 /usr/lib/spark/lib/spark-examples*.jar 100

Environment Variables

At startup, an init script is launched in the container. It uses global variables defined in /root/ to run the startup procedure. To enable or disable some initialization steps, you can mount a file that defines your own global variables. Here is the list of all defined global variables:

  • FORMAT_HDFS: Format the HDFS namenode, default value is true.
  • INITIALIZE_HDFS: Run Cloudera's HDFS init script, default value is true.
  • HADOOP_HOST: The hostname of the linked Hadoop server when running commands from container. See: files/


  • /var/lib/hadoop-hdfs/cache - HDFS data path. You can prevent formatting of HDFS with FORMAT_HDFS=false and INITIALIZE_HDFS=false environment variables.
  • /var/log/hadoop-hdfs - HDFS log path.
  • /var/log/hadoop-yarn - YARN log path.
  • /tmp/hadoop_conf - Any files located at this path will be copied to /etc/hadoop/conf forcefully on startup, use this to override the default hadoop configs. Note this only happens with the default launch command (see files/


All hadoop ports are exposed by default.

  • HDFS datanode
    • 50010 (TCP): dfs.datanode.address (DataNode HTTP server port)
    • 1004 secure (TCP): dfs.datanode.address
    • 50075 (TCP): dfs.datanode.http.address
    • 1006 secure (TCP): dfs.datanode.http.address
    • 50020 (TCP): dfs.datanode.ipc.address
  • HDFS namenode
    • 8020 (TCP): / fs.defaultFS
    • 50070 (TCP): dfs.http.address / dfs.namenode.http-address
    • 50470 secure (TCP): dfs.https.address / dfs.namenode.https-address
  • YARN resourcemanager
    • 8032 (TCP): yarn.resourcemanager.address
    • 8030 (TCP): yarn.resourcemanager.scheduler.address
    • 8031 (TCP): yarn.resourcemanager.resource-tracker.address
    • 8033 (TCP): yarn.resourcemanager.admin.address
    • 8088 (TCP): yarn.resourcemanager.webapp.address
  • YARN nodemanager
    • 8040 (TCP): yarn.nodemanager.localizer.address
    • 8042 (TCP): yarn.nodemanager.webapp.address
    • 8041 (TCP): yarn.nodemanager.address
  • MAPREDUCE historyserver
    • 10020 (TCP): mapreduce.jobhistory.address
    • 19888 (TCP): mapreduce.jobhistory.webapp.address

Oracle license

This container includes an Oracle JDK. By using this container, you accept the Oracle Binary Code License Agreement for Java SE available here:

Docker Pull Command
Source Repository