Public Repository

Last pushed: 2 years ago
Short Description
Extract a random sample from Wikipedia XML dump
Full Description

If you need to extract samples of different sizes of wikipedia articles in an XML format but you don't want to deal with XML parsing, this little bad boy is just what you need.

docker run -i -v ~/folder-with-xml-dump/:/work idio/wikistats-split-wiki-dump /work/enwiki-20150602-pages-articles.xml 10 20 40

Will produce following files in ~/folder-with-xml-dump/:

enwiki-20150602-pages-articles.xml.sample-10
enwiki-20150602-pages-articles.xml.sample-20
enwiki-20150602-pages-articles.xml.sample-40
Docker Pull Command
Owner
idio

Comments (0)