Public Repository

Last pushed: 5 months ago
Short Description
Global news munging using GDELT data, Apache spark and D3
Full Description

Container 1: spark master (spark 2.1)
Container 2: spark worker
Container 3: spark-notebook
Container 4: revealjs/D3

This docker-compose file assembles a Spark data munging environment that can be run from a laptop. The notebook "10_munge" connects to the spark-master which distributes the workload to spark worker. Additional workers can be added. Data is being processed using dataframes and Spark SQL.

The revealjs webserver (grunt) in the D3 container can be used to visualize results. This remains incomplete: a forthcoming container release will include an example of how to generate a json resultset to feed into D3.

The data is realtime ( ... well, 15-min latency) global news:
http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/
CSV files are transferred to the worker container for import as CSV into a dataframe. "10_munge" contains the scala commands to do this for one file. More to come.

The D3 visualization (not yet completed) will demonstrate how to use Spark SQL to visualize a complex crosstab of tabular data by visualizing a force-directed graph of new item topics, ranked by tone (very negative to very positive) accross various selected news organizations (BBC, Fox, NBC, NYT etc). It will depict the trend in tone for various actors (WHITEHOUSE, SYRIA, GOOGLE etc) as news items are published by each organization.

Spark Notebook: http://localhost:9001/
Spark Master: http://localhost:8080/

Docker Compose:
version: "2"
services:
  master:
    volumes:
      - .:/data
    image: singularities/spark
    command: start-spark master
    hostname: master
    ports:
      - "6066:6066"
      - "7077:7077"
      - "7070:7070"
      - "8080:8080"
      - "50070:50070"
  worker:
    volumes:
      - .:/data
    image: singularities/spark
    command: start-spark worker master
    hostname: worker_1
    environment:
      SPARK_WORKER_CORES: 2
      SPARK_WORKER_MEMORY: 3g
    links:
      - master
  notebook:
    image: markteehan/spark-notebook-gdelt:8
    hostname: notebook
    depends_on:
      - master
    ports:
      - "9000:9000"
      - "9001:9001"
      - "4040:4040"
      - "4041:4041"
      - "4042:4042"
      - "4043:4043"
      - "4044:4044"
      - "4045:4045"
    links:
      - master
      - worker
  d3:
    volumes:
      - .:/data
      - .:/revealjs/talks
    image: gamsd/revealjs
    hostname: d3
    links:
      - master
      - worker
      - notebook
    ports:
      - "8000:8000".

Visualization:
https://www.dropbox.com/s/6zizik7m6syhbsx/ezgif.com-gif-maker.mp4?dl=0

Docker Pull Command
Owner
markteehan