Apache Nutch 2 Docker Image (with HBase as storage layer)
Apache Nutch is a highly extensible and scalable open source web crawler software project. HBase is a fast storage layer that serves Nutch to keep all the links to crawl.
docker build --rm=true -t cogfor/nutch:2.3.1 .
conf/nutch-site.xml to define a crawler and configure storage
You may want to mount a local volume:
docker run -it -v conf:/nutch-source/conf -v /data cogfor/nutch:2.3.1
Although we use Apache HBase as our default storage, you are not limited in doing so. This image can be used independently of the supplied Standalone HBase install.