What is Nutch?
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
Current configuration of this image consists of components:
- Apache Hadoop 2.5.1
- Apache HBase 0.98.8-hadoop2
- Apache Nutch 2.X HEAD (this will ensure that you are always running off of bleeding edge)
- Install Docker.
2a. Download automated build from public hub registry
docker pull nutch/nutch_with_hbase_hadoop
2b. Build from files in this directory:
$(boot2docker shellinit) docker build -t <new name for image> .
boot2docker up $(boot2docker shellinit)
Start an image and enter shell. First command will start image and will print on stdout standard logs.
IMAGE_PID=$(docker run -i -t nutch_with_hbase_hadoop) docker exec -i -t $IMAGE_PID bash
Nutch is located in /opt/nutch/ and is almost ready to run.
Review configuration in /opt/nutch/conf/ and you can start crawling.
echo 'http://nutch.apache.org' > seed.txt /opt/nutch/bin/nutch inject seed.txt /opt/nutch/bin/nutch generate -topN 10 -- this will return batchId /opt/nutch/bin/nutch fetch <batchId> /opt/nutch/bin/nutch parse <batchId> /opt/nutch/bin/nutch updatedb <batchId> [...]