Hadoop is a famous software to achieve reliable, scalable and distributed computing.
However, setting an Hadoop cluster always troubles users by its configurations and networking complexity.
Moreover, in the research community, there is no easy way to quickly reproduce a running Hadoop cluster, which can help researchers to compare different existing approaches.
This project uses Docker to package an Hadoop cluster and all its dependencies as Docker images.
Users can use Docker containers to quickly assemble and provision an Hadoop infrastructure on demand.
We also provide some acknowledged benchmarks and self-adaptive scenarios atop of it in order to deliver a shared benchmark suite.
This project only depends on Docker and Bash script.
Before starting, please ensure that the below tools have been well installed.
The links following the software is the official tutorial or commands of installation.
docker-engine (version >= 1.9.1)
docker-machine (version >= 0.5.6)
$ curl -L https://github.com/docker/machine/releases/download/v0.5.6/docker-machine_linux-amd64 >/usr/local/bin/docker-machine && \ chmod +x /usr/local/bin/docker-machine
$ curl -L https://github.com/docker/machine/releases/download/v0.5.6/docker-machine_darwin-amd64 >/usr/local/bin/docker-machine && \ chmod +x /usr/local/bin/docker-machine
windows (using git bash)
$ if [[ ! -d "$HOME/bin" ]]; then mkdir -p "$HOME/bin"; fi && \ curl -L https://github.com/docker/machine/releases/download/v0.5.6/docker-machine_windows-amd64.exe > "$HOME/bin/docker-machine.exe" && \ chmod +x "$HOME/bin/docker-machine.exe"
bash (version >= 3)
This project can be cloned by the command:
$ git clone https://github.com/Spirals-Team/hadoop-benchmark.git
The directory contains several key components:
- It is the main bash of this project. The details can be found by running command:
$ ./cluster.sh --help
- With the default environment value, this script will create an example Hadoop cluster, which is composed of three nodes:
- Consul K/V store
- Hadoop control node
- Hadoop compute node 1
(Consul K/V store is used by docker to create an overlay network which help docker containers in different hosts to connect to each other.)
- Consul K/V store
- This directory contains base images. The built images should package the compiled Hadoop code, prepared configuration files and all required dependencies.
- This directory provides 4 benchmarks.
- The bundled Hadoop examples
- The bundled Spark examples
- SWIM - default 50 jobs
- This directory provides the source code of an alternative Hadoop images. Besides the basic Hadoop environment, a self-adaptive approach is also packaged in these images.
4. Getting Started Guide
This project aims to help users to quickly deploy a running Hadoop cluster.
It provides a set of commands to simplify and accelerate this deployment.
This guide illustrates how to execute this project.
Some example results are also provided to help users follow this guide.
Following this guide, users will create local Hadoop cluster, execute a benchmark, check the results, even rerun the benchmark with self-adaptive scenario (described in the following section).
The guide process contains the following phases:
- create cluster
- start hadoop
- run benchmark (experiment)
- check results
Furthermore, users can also create different Hadoop cluster to compare the performance difference between configurations (or scenario), following the further guide in next section:
- update Hadoop configuration / change to self-adaptive scenario
- restart hadoop
The guide process can also be visualized in the below image:
local_cluster file and modify the desired number of nodes, the
The default deployment is on local machine using ORACLE VirtualBox. One Hadoop instance requires about 2GB of RAM so be rather conservative to how many nodes can fit to your machine.
A number of additional virtualization environments are supported.
The list, including all their settings, can be found at the docker-machine website.
To use a different driver, simply update the
DRIVER variable export all require properties based on the driver requirements.
4.1 Creating a Cluster
$ CONFIG=local_cluster ./cluster.sh create-cluster
By modifying 'NUM_COMPUTE_NODES', users can create a different scale of Hadoop cluster after running the above command.
By default, this command will create the consul K/V store node firstly.
And then, a consul K/V store container will be launched in this node.
After the consul K/V store is ready, two nodes will be created as Hadoop control node and compute node 1.
The command will create an overlay network for all docker containers in this cluster at the end.
Users can use the following command to check the status of the nodes created:
$ CONFIG=local_cluster ./cluster.sh status-cluster
If the Hadoop cluster has been successfully created, the status-create would show some results like
4.2 Starting Hadoop
Once the step one finished, users can start hadoop cluster by another command:
$ CONFIG=local_cluster ./cluster.sh start-hadoop
This command will create a Hadoop control container in Hadoop control node.
This container shall run ResourceManager, NameNode, SecondaryNamenode and JobHistoryServer, which are master components of Hadoop.
When the Hadoop control container is running, a Hadoop compute container will be launched in each Hadoop Compute node.
This container supports NodeManager and Datanode, which are slave agents of Hadoop.
When the Hadoop cluster has been successfully started, there should be some results shown in the terminal:
PS: In order to be able to run the benchmarks, please do not forget to execute the below command reminded in the above result.
$ eval $(docker-machine env --swarm local-hadoop-controller)
The architecture of Hadoop cluster in local machine can be illustrated in the following schema:
4.3 Running Benchmarks
After step 2 successes, users can execute different benchmarks in the running Hadoop cluster.
4.3.1 Quick Test with Hadoop Bundled Examples
Users can quickly test the Hadoop cluster with the bundled example Hadoop commands.
$ ./benchmarks/hadoop-mapreduce-examples/run.sh pi 2 2
At the end of this command, if the terminal exposes some informations like below image, that means this Hadoop MapReduce application has been successfully treated. Users can obtain the details of the application in this terminal report.
4.3.2 Run HiBench
Users can also run HiBench on the Hadoop cluster, which is a famous hadoop benchmark provided by Intel. The launch command is like:
(Warning: each HiBench command will generates a lot of data (e.g. terasort input data is 1TB). So HiBench is not suitable for a local machine.)
4.3.3 Run SWIM
In this project, a SWIM example workloads is also provided.
This concurrent scenario contains 50 concurrent MapReduce jobs.
Users can launch this benchmark by the below command:
At the end of the test, all the job logs is stored in the directory "workGenLogs" in current directory.
(Warning: SWIM will launch many MapReduce applications running in parallel. So SWIM is also not suitable for a local machine.)
5. Self-Balancing Scenario
Before this scenario, please ensure the Hadoop cluster has been stopped.
Users can stop the Hadoop cluster by command:
$ CONFIG=local_cluster ./cluster.sh destroy-hadoop
In Self-balacing Scenario, besides a running Hadooop cluster, a self-adaptive approach is also running in Hadoop control node.
This approach automatically balances the job-parallelism and job-throughput based on the memory utilization of the whole Hadoop cluster.
To start the self-balancing scenario, all the commands to be used are similar to those in Section 4.
local_cluster configuration file should be replaced with the
The commands should be like:
$ CONFIG=scenarios/self-balancing-example/local_cluster ./cluster.sh start-hadoop
6. OPTION: Deployment in Grid5000
This project can also be used in Grid5000.
In Grid5000, this project uses generic driver to create the cluster with the existing hosts.
After the project is cloned in frontend, users should use
OAR commands to reserve hosts firstly and then install OS by
PS: because a docker overlay network requires kernel (version >= 3.1.6), we suggest users using the environment
jessie-x64-base to install
Debian Jessie in the hosts.
When users have finished the above step, there are several existing running hosts reserved.
Then, secondly, several parameters in
g5k_cluster file should be reconfigured.
Users should use their proper host-ips to replace the examples in some parameters like:
docker-machine can successfully create cluster, SSH private key file should be correctly set in
When the two steps has been finished, users can create a cluster and start hadoop in Grid5000 with the commands presented in section 4.
But the configuration file should be replaced with
The commands should be like:
$ CONFIG=g5k_cluster ./cluster.sh create-cluster
PS: in the frontend of Grid5000, there are no docker and docker-machine installed. So users should install the two tools in the user home. And add the user home to
Path environment parameter.