Public Repository

Last pushed: 3 months ago
Short Description
A convenient docker container to run Jupyter notebooks with Spark on Yarn.
Full Description

The source code can be found on GitHub at: https://github.com/francoisjehl/docker-jupyter-spark-on-yarn

Build it

Make sure you have docker installed.
Then build the image from within the folder.

docker build -t fjehl/docker-jupyter-spark-on-yarn .

Once built, you can run it, once you provide a set of configurations:

You need to supply 3 external volumes for it to run

  • Hadoop configuration files such as yarn-site.xml to be mounted in the host as /etc/hadoop/conf/*
  • krb5.conf to be mounted in the guest as /etc/krb5.conf
  • hive-site.xml to be mounted in the guest as /usr/local/spark/conf/hive-site.xml

The container also needs your Kerberos credentials as environment variables

  • KRB_LOGIN should be your fully qualified Kerberos login (j.doe@REALM.COM)
  • KRB_PASSWORD for your password

Finally, to be able to run without --net=host, we require your IP to be exposed to Spark so that the driver is reachable from yarn

  • DOCKER_HOST_IP should be hence set to the host IP

Then you can run it as follows:

docker run -ti \
  -v <path_containing_hadoop_conf>:/etc/hadoop \
  -v <krb5_conf_file>:/etc/krb5.conf \
  -v <hive_site_xml_file>:/usr/local/spark/conf/hive-site.xml \
  -e DOCKER_HOST_IP=<ip> \
  -e KRB_LOGIN=<user> \
  -e KRB_PASSWORD=<password> \
  fjehl/docker-jupyter-spark-on-yarn

No worries about credential expiration: the docker instance embeds k5start and keeps your ticket active.

Use it

Once the container is started, it will display a link in the console. Click on it, and enter Jupyter.
Start a R notebook.
You can test all the features (read from hive, train model, write to HDFS, with a script like the following).

# Connect to Yarn
library(sparklyr)
library(dplyr)
require(DBI)
sc <- spark_connect(master = "yarn-client")


# Get data from Hive
dbGetQuery(sc,'USE fjehl')
mtcars_df <- tbl(sc,"mtcars")

# Train a linreg model
mt_cars_partitions <- mtcars_df %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

linreg_model <- mt_cars_partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))

# Graph the results
library(plotly)
result <- select(sdf_predict(linreg_model, mt_cars_partitions$test), wt, cyl, mpg, prediction)
p <- plot_ly(collect(result), x = ~wt, y = ~cyl, z = ~mpg) %>%
  add_markers(name='actual') %>%
  add_markers(z=~prediction, name='prediction') %>%
  layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Cylinder'),
                     zaxis = list(title = 'Gross horsepower')))
embed_notebook(p)

# Write the results in HDFS
spark_write_csv(result, header=FALSE, path="hdfs://root/user/f.jehl/spark_results")
dbGetQuery(sc,"CREATE EXTERNAL TABLE IF NOT EXISTS fjehl.results(
  wt double,
  cyl int,
  mpg double,
  prediction double)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION
  'hdfs://root/user/f.jehl/spark_results'")
dbGetQuery(sc,"SELECT * FROM fjehl.results")
Docker Pull Command
Owner
fjehl

Comments (0)