Public Repository

Last pushed: a year ago
Short Description
This is a basic environment of spark and Cassandra.
Full Description

Spark and Cassandra are parts of this distribute system, use docker to make it more easier to deploy the system.

Docker Pull Command
Owner
mobilefzb

Comments (11)
mobilefzb
a year ago

Install numpy and scipy to the environment.

mobilefzb
a year ago

#How to use spark-shell to connect Spark & Cassandra cluster.

  1. Do not start spark-shell as localhost mode, and create SparkContext in the shell, because spark-shell will create a default SparkContext object when it started(default name is sc). You can use the arguments to pass the config to the spark-shell and let it regist to the spark master as a application(use --master to set which spark master you will connect.), the following are a demo to start a spark-shell with config and Cassandra connector package:
    ./bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.0-M2-s_2.10 --master spark://10.10.120.197:7077 --conf spark.cassandra.connection.host=10.16.170.14 --conf spark.driver.allowMultipleContexts=true --conf spark.driver.host=10.10.120.173 --conf spark.driver.allowMultipleContexts=true
  2. A demo how to use the default SparkContext (the name is sc) to connect Cassandra (get the number of records from table test.kv):
    import com.datastax.spark.connector._
    val rdd = sc.cassandraTable("test","kv")
    println(rdd.count)
mobilefzb
a year ago

Use this images to start a Spark-Cassandra development environment:

start Cassandra

docker run -d --net="host" --env=CASSANDRA_START_FLAG=1 --env=MAX_HEAP_SIZE=2G --env=HEAP_NEWSIZE=400M -v $PWD/cassandra_data:/home/big_data/apache-cassandra-3.4/data -v $PWD/cassandra_tmp:/var/tmp mobilefzb/ubuntu-spark-cassandra-docker /bin/bash /home/big_data/system_bootstrap.sh

start spark master

docker run -d --net="host" --env=SPARK_MASTER_START_FLAG=1 --env=SPARK_MASTER_IP=ip addr show eth0 | grep -P -o '(?<=inet )(\d+\.){3}\d+' --env=SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=1 -v $PWD/spark_tmp:/home/big_data/spark-1.6.0/logs mobilefzb/ubuntu-spark-cassandra-docker /bin/bash /home/big_data/system_bootstrap.sh

start spark worker

docker run -d --net="host" --env=SPARK_WORKER_START_FLAG=1 --env=SPARK_WORKER_CORES=1 --env=SPARK_WORKER_MEMORY=2g --env=SPARK_M_URL=spark://ip addr show eth0 | grep -P -o '(?<=inet )(\d+\.){3}\d+':7077 -v $PWD/spark_tmp:/var/tmp mobilefzb/ubuntu-spark-cassandra-docker /bin/bash /home/big_data/system_bootstrap.sh

Then you can use spark-shell or spark-submit to execute your demo. ^_^

mobilefzb
a year ago

sbt was installed in the image. ^_^

mobilefzb
a year ago

Python3 and Python2 have been integrate in the image. ^_^

mobilefzb
a year ago

Next is to install python in the image.^_^

mobilefzb
a year ago

When docker run with argument '-d', it means the command should not exit itself if you want to make it run as a daemon process. Many of us may run a script to start the docker environment, at the begging, please add a loop at the end or start your program at front.

mobilefzb
a year ago

docker run -d --net="host" --env=CASSANDRA_START_FLAG=1 -v $PWD/cassandra_data:/home/big_data/apache-cassandra-3.4/data mobilefzb/ubuntu-spark-cassandra-docker /bin/bash /home/big_data/system_bootstrap.sh

mobilefzb
a year ago

Today, the bootstrap script can start Cassandra and Spark optionally by the environment parameters. The script will check these 3 parameters to determine the component if can be started. However, there is one thing need to do next, Cassandra data folder should not be located in docker, we need to mount a folder out of the docker to make sure the image is unchangeable.

mobilefzb
a year ago

Add Cassandra 3.4 in the docker. Next is to write a bootstrap script to start Cassandra, Spark master and Spark worker. All of them will be optional. So you can use one docker to start a basic development environment or start a cluster.