Public Repository

Last pushed: 10 months ago
Short Description
Single-host, multi-node deployment of CDH 5.8 and Cloudera Manager 5.8
Full Description

Background

To enable a multi-node cluster deployment on the same Docker host (as requested by CDH users for testing and self-learning), we have created a CDH topology for Apache HBase’s clusterdock framework, a simple, Python-based library designed to orchestrate multi-node cluster deployments on a single host.

Unlike existing tools like Docker Compose, which are great at managing microservice architectures, clusterdock orchestrates multiple containers to act more like traditional hosts. In this paradigm, a four-node Apache Hadoop cluster uses four containers. Inside Cloudera, we’ve found it to be a great tool for testing and prototyping (but not intended nor supported for production use).

Getting Started

To begin, install Docker on your host. Older versions of Docker lack the embedded DNS server and correct reverse hostname lookup required by Cloudera Manager, so ensure you’re running Docker 1.11.0 or newer. Also, keep in mind that the host you use to run your CDH cluster must meet the same resource requirements as a normal multi-node deployment. Therefore, we recommend at least 16GB of free RAM for a two-node cluster and at least 24GB of free RAM for a four-node cluster.

For ease-of-use and portability, clusterdock itself is packaged in a Docker image and its binaries are executed by running containers from this image and specifying an action. This can be done by sourcing the clusterdock.sh helper script and then calling script of interest with the clusterdock_run command. As is always a good idea when executing code from the internet, examine the script to convince yourself of its safety, and then run

source /dev/stdin <<< "$(curl -sL http://tiny.cloudera.com/clusterdock.sh)"

With everything ready to go, let’s get started!

Basic Usage

Starting a cluster with clusterdock takes advantage of an abstraction known as a topology; in short, a basic set of steps needed to coordinate pre-built Docker images into a functioning multi-container cluster. If all you’d like is a two-node cluster (with default options being used for everything else), simply type:

clusterdock_run ./bin/start_cluster cdh

When this is run, clusterdock will start two containers from images stored on Docker Hub. As they contain a full Cloudera Manager/CDH deployment, downloading the images the first time may take upwards of five minutes, but this is a one-time cost as the images are then cached locally by Docker. As the cluster starts, clusterdock manages communication between containers through Docker’s bridge networking driver and also updates your host’s /etc/hosts file to make it easier to connect to your container cluster.

Once the cluster is running and the health of your CDH services is validated, you can access the cluster through the Cloudera Manager UI (the address and port number are shown at the end of the startup process). You can also SSH directly to nodes of your cluster using the clusterdock_ssh function where the argument is the fully qualified domain name of the node. For example, running:

clusterdock_ssh node-1.cluster

drops us into a shell without having to deal with setting up SSH keys on the host:

Warning: Permanently added 'node-1.cluster,192.168.124.2' (RSA) to the list of known hosts.
Last login: Mon Jul 25 11:11:36 2016 from 192.168.124.1
[root@node-1 ~]#

Advanced Usage

clusterdock supports a number of options that can provide for a more interesting testing environment. We provide a few examples in the sections below, but full usage instructions can be seen by including --help in the invocation of the start_cluster script:

clusterdock_run ./bin/start_cluster --help

or the cdh topology itself:

clusterdock_run ./bin/start_cluster cdh --help

Larger Cluster Deployments

If your machine has the available resources, clusterdock allows you to start n-node sized clusters where one node acts as the CM server (and has the majority of CDH service roles assigned to it) and the remaining n-1 nodes act as secondaries with most CDH slave services assigned to them. As an example, to create a four-node CDH cluster in which the containers are named node-1.testing, node-2.testing, node-3.testing, and node-4.testing:

clusterdock_run ./bin/start_cluster -n testing cdh --primary-node=node-1 --secondary-nodes='node-{2..4}'

In this case, the clusterdock CDH topology takes advantage of Cloudera Manager’s host template functionality to distribute the roles on node-2 to node-3 and node-4. That is, with only two images, clusterdock allows for arbitrarily-sized cluster deployments. (Again, this is a full cluster running on a single host with a single host’s worth of resources. Be careful!)

Specifying Services to Include (or Exclude)

The clusterdock CDH topology allows you to provide a list of the service types to include in your cluster. This functionality uses the --include-service-types option and removes any service type from Cloudera Manager not included in the list. For example, to create a two-node cluster with only HDFS, Apache ZooKeeper, Apache HBase, and YARN present:

clusterdock_run ./bin/start_cluster cdh --include-service-types=HDFS,ZOOKEEPER,HBASE,YARN

Similarly, an --exclude-service-types option can be used to explicitly leave out services. To create a four-node cluster (machine-1.mycluster, machine-2.mycluster, machine-3.mycluster, machine-4.mycluster) without Impala present:

clusterdock_run ./bin/start_cluster -n mycluster cdh --primary-node=machine-1 --secondary-nodes=’machine-{2..4}’ --exclude-service-types=IMPALA

For a full list of service types, refer to the Cloudera Manager documentation.

Getting Help

While the single-node approach was good for learning and ramp up, the new Cloudera QuickStart for Docker is also excellent for test and development. It provides an easy way to prototype new ideas and use cases, as well as try out new functionality and the latest Cloudera releases. (Just remember, it’s not intended nor supported for production use.)

Lastly, we’d love to know what you think. Please post any and all feedback in our Community Forum; we’d like to hear both positive and constructive suggestions for future improvements.

See Cloudera's documentation and Cloudera's website for other information, including the license agreement associated with this image.

Docker Pull Command
Owner
cloudera

Comments (18)
tilakacharya
3 months ago

Any plan of updating the images to latest version of CDH ?

dockerk8s007
8 months ago

Getting error:
[root@seals1 ~]# docker --version
Docker version 1.12.5, build 047e51b/1.12.5
[root@seals1 ~]# docker pull cloudera/clusterdock
Using default tag: latest
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
[root@seals1 ~]# sudo systemctl start docker
[root@seals1 ~]# docker pull cloudera/clusterdock
Using default tag: latest
Trying to pull repository docker.io/cloudera/clusterdock ...
latest: Pulling from docker.io/cloudera/clusterdock
046d0f015c61: Pull complete
174ef1e4e314: Pull complete
3ae21ba03e3d: Pull complete
6beebbaa4384: Pull complete
Digest: sha256:87324b26e30aec51aea3d5a2e493d477e954b9f44a3c84e9d50269ff1c189a77
[root@seals1 ~]# source /dev/stdin <<< "$(curl -sL http://tiny.cloudera.com/clusterdock.sh)"
[root@seals1 ~]#
[root@seals1 ~]# clusterdock_run ./bin/start_cluster cdh
INFO:clusterdock.topologies.cdh.actions:Pulling image docker.io/cloudera/clusterdock:cdh580_cm581_primary-node. This might take a little while...
Trying to pull repository docker.io/cloudera/clusterdock ...
cdh580_cm581_primary-node: Pulling from docker.io/cloudera/clusterdock
3eaa9b70c44a: Pull complete
99ba8e23f310: Pull complete
c9c08e9a0d03: Pull complete
7434a9a99daa: Pull complete
d52d9baa0ee6: Pull complete
00ca224ba661: Pull complete
Digest: sha256:9feffbfc5573262a6efbbb0a969efde890e63ced8a4ab3c9982f4f0dc607e429
INFO:clusterdock.topologies.cdh.actions:Pulling image docker.io/cloudera/clusterdock:cdh580_cm581_secondary-node. This might take a little while...
Trying to pull repository docker.io/cloudera/clusterdock ...
cdh580_cm581_secondary-node: Pulling from docker.io/cloudera/clusterdock
3eaa9b70c44a: Already exists
99ba8e23f310: Already exists
c9c08e9a0d03: Already exists
7434a9a99daa: Already exists
d52d9baa0ee6: Already exists
f70deff0592f: Pull complete
Digest: sha256:251778378b362adff4e93b99d423848216e4823965dabd1bd4c41dbb4c79afcf
INFO:clusterdock.cluster:Network (cluster) not present, creating it...
INFO:clusterdock.cluster:Successfully setup network (name: cluster).
INFO:clusterdock.cluster:Successfully started node-2.cluster (IP address: 192.168.123.3).
INFO:clusterdock.cluster:Successfully started node-1.cluster (IP address: 192.168.123.2).
INFO:clusterdock.cluster:Started cluster in 8.79 seconds.
INFO:clusterdock.topologies.cdh.actions:Changing server_host to node-1.cluster in /etc/cloudera-scm-agent/config.ini...
INFO:clusterdock.topologies.cdh.actions:Restarting CM agents...
cloudera-scm-agent is already stopped
cloudera-scm-agent is already stopped
Starting cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory
[FAILED]
Starting cloudera-scm-agent: bash: /var/log/cloudera-scm-agent/cloudera-scm-agent.out: No such file or directory
[FAILED]

Fatal error: run() received nonzero return code 1 while executing!

Requested: service cloudera-scm-agent restart
Executed: /bin/bash -l -c "service cloudera-scm-agent restart"

Aborting.

Fatal error: run() received nonzero return code 1 while executing!

Requested: service cloudera-scm-agent restart
Executed: /bin/bash -l -c "service cloudera-scm-agent restart"

Aborting.

Fatal error: One or more hosts failed while executing task '_task'

Aborting.
INFO:clusterdock.topologies.cdh.actions:Waiting for Cloudera Manager server to come online...
Traceback (most recent call last):
File "./bin/start_cluster", line 70, in <module>
main()
File "./bin/start_cluster", line 63, in main
actions.start(args)
File "/root/clusterdock/clusterdock/topologies/cdh/actions.py", line 108, in start
CM_SERVER_PORT, timeout_sec=180)
File "/root/clusterdock/clusterdock/utils.py", line 52, in wait_for_port_open
timeout_sec, address, port
Exception: Timed out after 180 seconds waiting for 192.168.123.2:7180 to be open.

medhavib
9 months ago

The clusterdock.sh script has few issues I discovered when trying to ssh into the containers: there is 'sudo' missing on two commands: Line #141 (docker ps) and line #142, docker inspect. Without the two sudos the ssh did not work for me. Please incorporate those fixes.

daviddaedalus
a year ago

you don't though - or at least that's my understanding of docker. the point is to expose the ports and let docker and the host work out the mapping when you fire up the container.

dimaspivak
a year ago

I think the issue with exposing too many ports is the need to then say which host ports they're being directed to. As it stands, you can access any ports on any container as long as you're on the same host running those containers. If you're not running your application from that host, a SOCKS 5 tunnel is one easy way to expose more ports without making the clusterdock topology's API get pretty gross.

seachange
a year ago

Hi dimaspivak,

I got you latest change with adding hue port, thanks.
what I would like to get is option to expose all web consoles port(for example spark), all http port(for example hdfs web) should be optional expose.

currently when clusterdock invoke docker-py are using
EXPOSE + publish-all mode,
would it possible to use host_configs.port_bindings .

1, first step may consider if you can open the "Cluster::ports" in cluster.py configurable via the entry bash.

2, consider possible add host_configs.port_bindings to Cluster class, and configurable via the entry bash.

dimaspivak
a year ago

Hey seachange,

Please try again. I've just pushed a fix to expose the Hue port on the host.

seachange
a year ago

currently only CM port 7180 been forwarded to 12001.

seachange
a year ago

Hi,

How do I add other port forwarding for example,
allow hue UI port :8888 to access from outer.

From current
clusterdock_run ./bin/start_cluster cdh --help

I can not find any where it allow me to add additional
portforwarding.

dimaspivak
a year ago

Hi tigarchen,

How much RAM do you have made available to the Linux VM running Docker on your Mac? Please open a thread at the community forum.