Public Repository

Last pushed: 4 months ago
Short Description
A image for launching Spark and Zeppelin sandbox
Full Description

Zeppelin on Docker

This project aims to help you play Spark with Python, R and Scala on a web GUI project Zeppelin.
For some demonstrations, please refer to the official site of Zeppelin.

Versions

  • Spark: 2.0
  • Zeppelin: 0.7.1
  • Python: 2.7
  • R: 3.3.1
  • Scala: 2.11

Install Docker

Follow the description at the docker getting started page for your appropriate OS: (Ubuntu, CentOS, Debian)

Start Application

Zeppelin and Spark will be running inside the Docker container. To access the web GUI and import data, you have to specify the port forwardings and volume attachments in the Docker command.

Check Available Ports

Find two available ports on your host machine to access Zeppelin and Spark-UI from outside world.
On Linux console, using the command sudo netstat -nlp to find the current listened ports, for example

root@ubuntu:~# sudo netstat -nlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1186/sshd                
tcp6       0      0 :::22                   :::*                    LISTEN      1186/sshd       
Active UNIX domain sockets (only servers)

This means the port 22 is occupied by the process named sshd, and we should choose integers(1-65535) other than 22 for Zeppelin or Spark.

Choose the Binding Ports on Host

Continue with the previous example, choose two integers between 1 and 65535 other than 22. For example, 32770 and 32771.

Specify the Volume Attachment

To keep your notebook and data eternal, choose a directory path on your host machine for storing purpose.
For example, /usr/zeppelin_dir.
Note that, the directory path need not exist in previously, Docker would create one for you if it did not exist.

Docker Run Command

Adopt from the above examples, we using the following command to run Zeppelin on Docker Container

root@ubuntu:~# sudo docker run -itd \
> -p 32771:8080 \
> -p 32270:4040 \
> -v /usr/zeppelin_dir:/workspace \
> robinlin/zeppelin \
> /bin/bash

Or in one line command

sudo docker run -itd -p 32771:8080 -p 32270:4040 -v /usr/zeppelin_dir:/workspace robinlin/zeppelin /bin/bash

Access Zeppelin from Browser

From above example, the Zeppelin service is bound on port 32771 while 32270 is for the Spark-UI.
Open your browser, and on the URL search bar type for example http://hostmachine.example.tw:32771, you will see the Zeppelin welcome page.

Note: If your host is a virtual machine on clouds such as GCE or AWS EC2, you have to make sure your firewall rules allow TCP for the specified ports such as 32771 and 32770 in this example.

Import and Load Data

By this version of Zeppelin (0.6.1), data upload and download are not supported, ref. One can only put his data in the attached volume from host.
Following steps show how to load and save your data. Adopt from the examples above, we have specified an attached volume say /usr/zeppelin_dir

  1. Data Import: Moving your data to /usr/zeppelin_dir on host, e.g. cp user_data.csv /usr/zeppelin_dir
  2. Load Data: On your Zeppelin notebook, using the path /workspace/user_data.csv to read file, e.g. user_data = sc.textFile('/workspace/user_data.csv').
  3. Save Data: On your Zeppelin notebook, save your data to the directory path /workspace, e.g. user_data2.save('/workspace/user_data2.csv') and you can find the file user_data2.csv on your host's file system path /usr/zeppelin_dir/user_data2.csv.

Update to Latest

Using command

sudo docker pull robinlin/zeppelin

Examples

For more examples, please refer to my Github

Docker Pull Command
Owner
robinlin

Comments (0)