Public | Automated Build

Last pushed: a year ago
Short Description
Docker image with spark integrated with pyspark-mongo
Full Description

docker-spark-mongo

PySpark + Mongo Hadoop:

  • Ubuntu 16.04
  • Apache Spark 1.6.1
  • Mongo Hadoop 1.5.1

To see how it work, you can run a mongo instance:

docker run -d --name mongo mongo:3.2

And after that, run this image:

docker run -i -t --link mongo:mongo josemyd/docker-spark-mongo /bin/bash

Finally, run pyspark as follow:

$ pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}

And then check if it works:

import pymongo
import pymongo_spark

mongo_url = 'mongodb://mongo:27017/'

client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
    {"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()

pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
    .map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()

## [(1.0, -1.0), (0.0, 4.0)]

Fork from zero323 with reference to stackoverflow. Fixed some problems with enviroment variables and java.

Reference:
mongo-hadoop

Docker Pull Command
Owner
josemyd
Source Repository

Comments (0)