Public | Automated Build

Last pushed: 10 months ago
Short Description
Place your data in a folder ("label"). Mount those folders to this container. Train. Win
Full Description


Build mahout from source in a docker container


After logging into docker:

docker run -it -v <path_to_corpus>:/data/corpus:ro borromeotlhs/docker-mahout /bin/bash

assuming you've done the above, and that your corpus is segmented under self-labeled directories, you can train a Complementary NaiveBayes classifier on your corpus with:

$ /usr/local/mahout/bin/mahout seqdirectory
-i /data/corpus
-o /data/corpus-seq
-xm sequential

<code>$ /usr/local/mahout/bin/mahout seq2sparse
-i /data/corpus-seq
-o /data/corpus-vectors
-wt tfidf
-ng 3
-n 2
--maxDFPercent 85

<code>$ /usr/local/mahout/bin/mahout split
-i /data/corpus-vectors/tfidf-vectors
--trainingOutput /data/corpus-train-vectors
--testOutput /data/corpus-test-vectors
--randomSelectionPct 40
--overwrite --sequenceFiles -xm sequential

<code>$ /usr/local/mahout/bin/mahout trainnb
-i /data/corpus-train-vectors
-o /data/model
-li /data/labelindex

(The above command line tells mahout, via the '-el' option, to extract labels and to store them, via the '-li' option, to ${WORK_DIR}/labelindex.
You could, alternatively, utilize the '-l' option to provide your own csv file of labels to utilize on the input)

and will allow us to test with:

$ /usr/local/mahout/bin/mahout testnb
-i /data/corpus-test-vectors
-m /data/model
-l /data/labelindex
-o /data/corpus-testing


You tell me ;)

Docker Pull Command
Source Repository

Comments (2)
9 months ago

now updated to:
mahout:0.12.2 and hadoop 2.7.3

3 years ago

No longer built from source, mahout v0.9 and hadoop 2.2.0 tarballs are downloaded and runnable in this image.