Get up and running quickly with Nutch on Docker.
What is Nutch?
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
Current configuration of this image consists of components:
- Nutch 1.x
You may need to alias docker to "docker --tls" if you see errors such as:
2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"
The easiest way to do this:
alias docker="docker --tls"
Build from files in this directory:
$(boot2docker shellinit | grep export)
docker build -t apache/nutch .
boot2docker up $(boot2docker shellinit | grep export)
Start up an image and attach to it
docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash docker attach --sig-proxy=false nutchcontainer
Nutch is located in ~/nutch and is almost ready to run.
You will need to set seed URLs and update the configuration with your crawler's Agent Name.
For additional "getting started" information checkout the Nutch Tutorial.
I have seen that all the recent builds are not completed due to errors. The last correct one with Nutch 2.X is from one year ago.
In that release I have problems stopping the container. When I stop it with docker stop, I cannot start it again since it continually shows:
Re-format filesystem in Storage Directory /home/hduser/data/hadoop/nn ? (Y or N)
I have an issue with the pluggable indexing, notably the elastic indexer - is the Apache Jira space the best place to raise it as a concern?
I have entered the elastic host/cluster/index details in all 3 of the following config files but none seem to be taken into account:-
nutch/conf/nutch-default.xml nutch/conf/nutch-site.xml nutch/conf/elasticsearch.conf
IO Error encountered when attempting to use the elastic indexer plugin, which would appear to be due to missing elastic config.
root@660cd713153a:~/nutch# bin/nutch index crawl/crawldb -all The input path at -all is not a segment... skipping Indexer: starting at 2017-05-23 15:04:02 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB) Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
I think the docs need housekeeping - tag doesn't exist.
core@ElasticSearch ~ $ docker pull nutch/nutch_with_hbase_hadoop
Using default tag: latest
Pulling repository docker.io/nutch/nutch_with_hbase_hadoop
Error: image nutch/nutch_with_hbase_hadoop:latest not found
Thank you Apache !