What is Nutch?
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
Current configuration of this image consists of components:
- Apache Hadoop 2.5.1
- Apache HBase 0.98.8-hadoop2
- Apache Nutch 2.X HEAD (this will ensure that you are always running off of bleeding edge)
- Install Docker.
2a. Download automated build from public hub registry
docker pull nutch/nutch_with_hbase_hadoop
2b. Build from files in this directory:
$(boot2docker shellinit) docker build -t <new name for image> .
boot2docker up $(boot2docker shellinit)
Start an image and enter shell. First command will start image and will print on stdout standard logs.
IMAGE_PID=$(docker run -i -t nutch_with_hbase_hadoop) docker exec -i -t $IMAGE_PID bash
Nutch is located in /opt/nutch/ and is almost ready to run.
Review configuration in /opt/nutch/conf/ and you can start crawling.
echo 'http://nutch.apache.org' > seed.txt /opt/nutch/bin/nutch inject seed.txt /opt/nutch/bin/nutch generate -topN 10 -- this will return batchId /opt/nutch/bin/nutch fetch <batchId> /opt/nutch/bin/nutch parse <batchId> /opt/nutch/bin/nutch updatedb <batchId> [...]
I have seen that all the recent builds are not completed due to errors. The last correct one with Nutch 2.X is from one year ago.
In that release I have problems stopping the container. When I stop it with docker stop, I cannot start it again since it continually shows:
Re-format filesystem in Storage Directory /home/hduser/data/hadoop/nn ? (Y or N)
I have an issue with the pluggable indexing, notably the elastic indexer - is the Apache Jira space the best place to raise it as a concern?
I have entered the elastic host/cluster/index details in all 3 of the following config files but none seem to be taken into account:-
nutch/conf/nutch-default.xml nutch/conf/nutch-site.xml nutch/conf/elasticsearch.conf
IO Error encountered when attempting to use the elastic indexer plugin, which would appear to be due to missing elastic config.
root@660cd713153a:~/nutch# bin/nutch index crawl/crawldb -all The input path at -all is not a segment... skipping Indexer: starting at 2017-05-23 15:04:02 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB) Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
I think the docs need housekeeping - tag doesn't exist.
core@ElasticSearch ~ $ docker pull nutch/nutch_with_hbase_hadoop
Using default tag: latest
Pulling repository docker.io/nutch/nutch_with_hbase_hadoop
Error: image nutch/nutch_with_hbase_hadoop:latest not found
Thank you Apache !