Public | Automated Build

Last pushed: a month ago
Short Description
A docker container providing a web harvester for use with Social Feed Manager.
Full Description


A wrapper around Heritrix for harvesting web content as part of Social Feed Manager.

As of SFM 1.12, sfm-elk is deprecated.


git clone
cd sfm-web-harvester
pip install -r requirements/requirements.txt

Note that requirements/requirements.txt references the latest release of sfm-utils.
If you are doing development on the interaction between sfm-utils and sfm-web-harvester,
use requirements/dev.txt. This uses a local copy of sfm-utils (../sfm-utils)
in editable mode.

Running as a service

Web harvester will act on harvest start messages received from a queue. To run as a service:

python service <mq host> <mq username> <mq password> <heritrix url> <heritrix username> <heritrix password> <contact url>

Process harvest start files

Web harvester can process harvest start files. The format of a harvest start file is the same as a harvest start message. To run:

python seed <path to file> <heritrix url> <heritrix username> <heritrix password> <contact url>

Integration tests (inside docker containers)

  1. Install Docker and Docker-Compose.
  2. Start up the containers.

     docker-compose -f docker/dev.docker-compose.yml up -d
  3. Run the tests.

     docker exec docker_sfmwebharvester_1 python -m unittest discover
  4. Shutdown containers.

     docker-compose -f docker/dev.docker-compose.yml kill
     docker-compose -f docker/dev.docker-compose.yml rm -v --force

Harvest start messages

See the messaging specification for how to construct a harvest start message.

Docker Pull Command
Source Repository