Public | Automated Build

Last pushed: 2 years ago
Short Description
Creates a Jsonwikipedia
Full Description


Project build around :

This docker image converts uncompressed XML dumps to JSON-per-line file. It uses Spark to parallelise the process, so the image explicitly sets minimum RAM to 20G - it won't start if this amount of memory is unavailable.

  • Checks out json-wikipedia's master branch
  • Builds the jar
  • input path to uncompressed wikipedia xml dump
  • output path where the jsonwikipedia will be generated


docker run -v /tmp:/tmp -v /mnt:/mnt -i -t idio/jsonwikipedia -input /mnt/enwiki-20150602-pages-articles.xml -output /mnt/enwiki.json -lang en -action export-parallel


Your /tmp folder should have enough disk space.

Docker Pull Command
Source Repository

Comments (0)