Public | Automated Build

Last pushed: 3 months ago
Short Description
Apache Nutch
Full Description

Nutch Dockerfile

Get up and running quickly with Nutch on Docker.

What is Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Docker Image

Current configuration of this image consists of components:

  • Nutch 1.x

Base Image

Tips

You may need to alias docker to "docker --tls" if you see errors such as:

2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"

The easiest way to do this:

  1. alias docker="docker --tls"

Installation

  1. Install Docker.

  2. Build from files in this directory:

    $(boot2docker shellinit | grep export)
    docker build -t apache/nutch .

Usage

Start docker

boot2docker up
$(boot2docker shellinit | grep export)

Start up an image and attach to it

docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash
docker attach --sig-proxy=false nutchcontainer

Nutch is located in ~/nutch and is almost ready to run.
You will need to set seed URLs and update the configuration with your crawler's Agent Name.
For additional "getting started" information checkout the Nutch Tutorial.

Docker Pull Command
Owner
apache
Source Repository

Comments (4)
jlacasta
6 days ago

I have seen that all the recent builds are not completed due to errors. The last correct one with Nutch 2.X is from one year ago.
In that release I have problems stopping the container. When I stop it with docker stop, I cannot start it again since it continually shows:
Re-format filesystem in Storage Directory /home/hduser/data/hadoop/nn ? (Y or N)

jonjack
3 months ago

I have an issue with the pluggable indexing, notably the elastic indexer - is the Apache Jira space the best place to raise it as a concern?

I have entered the elastic host/cluster/index details in all 3 of the following config files but none seem to be taken into account:-

nutch/conf/nutch-default.xml
nutch/conf/nutch-site.xml
nutch/conf/elasticsearch.conf

IO Error encountered when attempting to use the elastic indexer plugin, which would appear to be due to missing elastic config.

root@660cd713153a:~/nutch# bin/nutch index crawl/crawldb -all
The input path at -all is not a segment... skipping
Indexer: starting at 2017-05-23 15:04:02
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
ElasticIndexWriter
    elastic.cluster : elastic prefix cluster
    elastic.host : hostname
    elastic.port : port
    elastic.index : elastic index command
    elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
    elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
jonjack
3 months ago

I think the docs need housekeeping - tag doesn't exist.

core@ElasticSearch ~ $ docker pull nutch/nutch_with_hbase_hadoop
Using default tag: latest
Pulling repository docker.io/nutch/nutch_with_hbase_hadoop
Error: image nutch/nutch_with_hbase_hadoop:latest not found

azote
8 months ago

Thank you Apache !