cloudera/clusterdock
Single-host, multi-node deployment of CDH 5.8 and Cloudera Manager 5.8
50K+
To enable a multi-node cluster deployment on the same Docker host (as requested by CDH users for testing and self-learning), we have created a CDH topology for Apache HBase’s clusterdock
framework, a simple, Python-based library designed to orchestrate multi-node cluster deployments on a single host.
Unlike existing tools like Docker Compose, which are great at managing microservice architectures, clusterdock
orchestrates multiple containers to act more like traditional hosts. In this paradigm, a four-node Apache Hadoop cluster uses four containers. Inside Cloudera, we’ve found it to be a great tool for testing and prototyping (but not intended nor supported for production use).
To begin, install Docker on your host. Older versions of Docker lack the embedded DNS server and correct reverse hostname lookup required by Cloudera Manager, so ensure you’re running Docker 1.11.0 or newer. Also, keep in mind that the host you use to run your CDH cluster must meet the same resource requirements as a normal multi-node deployment. Therefore, we recommend at least 16GB of free RAM for a two-node cluster and at least 24GB of free RAM for a four-node cluster.
For ease-of-use and portability, clusterdock
itself is packaged in a Docker image and its binaries are executed by running containers from this image and specifying an action. This can be done by sourcing the clusterdock.sh
helper script and then calling script of interest with the clusterdock_run
command. As is always a good idea when executing code from the internet, examine the script to convince yourself of its safety, and then run
source /dev/stdin <<< "$(curl -sL http://tiny.cloudera.com/clusterdock.sh)"
With everything ready to go, let’s get started!
Starting a cluster with clusterdock
takes advantage of an abstraction known as a topology; in short, a basic set of steps needed to coordinate pre-built Docker images into a functioning multi-container cluster. If all you’d like is a two-node cluster (with default options being used for everything else), simply type:
clusterdock_run ./bin/start_cluster cdh
When this is run, clusterdock
will start two containers from images stored on Docker Hub. As they contain a full Cloudera Manager/CDH deployment, downloading the images the first time may take upwards of five minutes, but this is a one-time cost as the images are then cached locally by Docker. As the cluster starts, clusterdock
manages communication between containers through Docker’s bridge networking driver and also updates your host’s /etc/hosts
file to make it easier to connect to your container cluster.
Once the cluster is running and the health of your CDH services is validated, you can access the cluster through the Cloudera Manager UI (the address and port number are shown at the end of the startup process). You can also SSH directly to nodes of your cluster using the clusterdock_ssh
function where the argument is the fully qualified domain name of the node. For example, running:
clusterdock_ssh node-1.cluster
drops us into a shell without having to deal with setting up SSH keys on the host:
Warning: Permanently added 'node-1.cluster,192.168.124.2' (RSA) to the list of known hosts.
Last login: Mon Jul 25 11:11:36 2016 from 192.168.124.1
[root@node-1 ~]#
clusterdock
supports a number of options that can provide for a more interesting testing environment. We provide a few examples in the sections below, but full usage instructions can be seen by including --help
in the invocation of the start_cluster
script:
clusterdock_run ./bin/start_cluster --help
or the cdh
topology itself:
clusterdock_run ./bin/start_cluster cdh --help
If your machine has the available resources, clusterdock
allows you to start n-node sized clusters where one node acts as the CM server (and has the majority of CDH service roles assigned to it) and the remaining n-1 nodes act as secondaries with most CDH slave services assigned to them. As an example, to create a four-node CDH cluster in which the containers are named node-1.testing
, node-2.testing
, node-3.testing
, and node-4.testing
:
clusterdock_run ./bin/start_cluster -n testing cdh --primary-node=node-1 --secondary-nodes='node-{2..4}'
In this case, the clusterdock
CDH topology takes advantage of Cloudera Manager’s host template functionality to distribute the roles on node-2
to node-3
and node-4
. That is, with only two images, clusterdock
allows for arbitrarily-sized cluster deployments. (Again, this is a full cluster running on a single host with a single host’s worth of resources. Be careful!)
The clusterdock
CDH topology allows you to provide a list of the service types to include in your cluster. This functionality uses the --include-service-types
option and removes any service type from Cloudera Manager not included in the list. For example, to create a two-node cluster with only HDFS, Apache ZooKeeper, Apache HBase, and YARN present:
clusterdock_run ./bin/start_cluster cdh --include-service-types=HDFS,ZOOKEEPER,HBASE,YARN
Similarly, an --exclude-service-types
option can be used to explicitly leave out services. To create a four-node cluster (machine-1.mycluster
, machine-2.mycluster
, machine-3.mycluster
, machine-4.mycluster
) without Impala present:
clusterdock_run ./bin/start_cluster -n mycluster cdh --primary-node=machine-1 --secondary-nodes=’machine-{2..4}’ --exclude-service-types=IMPALA
For a full list of service types, refer to the Cloudera Manager documentation.
While the single-node approach was good for learning and ramp up, the new Cloudera QuickStart for Docker is also excellent for test and development. It provides an easy way to prototype new ideas and use cases, as well as try out new functionality and the latest Cloudera releases. (Just remember, it’s not intended nor supported for production use.)
Lastly, we’d love to know what you think. Please post any and all feedback in our Community Forum; we’d like to hear both positive and constructive suggestions for future improvements.
See Cloudera's documentation and Cloudera's website for other information, including the license agreement associated with this image.
docker pull cloudera/clusterdock