Public | Automated Build

Last pushed: 5 months ago
Short Description
Apache Nutch
Full Description

Nutch Dockerfile

This directory contains a Dockerfile of Nutch 2.X for Docker.

What is Nutch?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Docker Image

Current configuration of this image consists of components:

  • Apache Hadoop 2.5.1
  • Apache HBase 0.98.8-hadoop2
  • Apache Nutch 2.X HEAD (this will ensure that you are always running off of bleeding edge)

Base Image


  1. Install Docker.

2a. Download automated build from public hub registry docker pull nutch/nutch_with_hbase_hadoop

2b. Build from files in this directory:

$(boot2docker shellinit)
docker build -t <new name for image> .


Start docker

boot2docker up
$(boot2docker shellinit)

Start an image and enter shell. First command will start image and will print on stdout standard logs.

IMAGE_PID=$(docker run -i -t  nutch_with_hbase_hadoop)
docker exec -i -t $IMAGE_PID bash

Nutch is located in /opt/nutch/ and is almost ready to run.
Review configuration in /opt/nutch/conf/ and you can start crawling.

echo '' > seed.txt
/opt/nutch/bin/nutch inject seed.txt
/opt/nutch/bin/nutch generate -topN 10 -- this will return batchId
/opt/nutch/bin/nutch fetch <batchId>
/opt/nutch/bin/nutch parse <batchId>
/opt/nutch/bin/nutch updatedb <batchId>


For more information on Nutch 2.X please see the tutorials and Nutch 2.X wiki space.

Docker Pull Command
Source Repository

Comments (4)
2 months ago

I have seen that all the recent builds are not completed due to errors. The last correct one with Nutch 2.X is from one year ago.
In that release I have problems stopping the container. When I stop it with docker stop, I cannot start it again since it continually shows:
Re-format filesystem in Storage Directory /home/hduser/data/hadoop/nn ? (Y or N)

5 months ago

I have an issue with the pluggable indexing, notably the elastic indexer - is the Apache Jira space the best place to raise it as a concern?

I have entered the elastic host/cluster/index details in all 3 of the following config files but none seem to be taken into account:-


IO Error encountered when attempting to use the elastic indexer plugin, which would appear to be due to missing elastic config.

root@660cd713153a:~/nutch# bin/nutch index crawl/crawldb -all
The input path at -all is not a segment... skipping
Indexer: starting at 2017-05-23 15:04:02
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
    elastic.cluster : elastic prefix cluster : hostname
    elastic.port : port
    elastic.index : elastic index command : elastic bulk index doc counts. (default 250)
    elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

Indexer: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(
    at org.apache.nutch.indexer.IndexingJob.index(
    at org.apache.nutch.indexer.IndexingJob.main(
5 months ago

I think the docs need housekeeping - tag doesn't exist.

core@ElasticSearch ~ $ docker pull nutch/nutch_with_hbase_hadoop
Using default tag: latest
Pulling repository
Error: image nutch/nutch_with_hbase_hadoop:latest not found

10 months ago

Thank you Apache !