A wrapper around Heritrix for harvesting web content as part of Social Feed Manager.
As of SFM 1.12, sfm-elk is deprecated.
git clone https://github.com/gwu-libraries/sfm-web-harvester.git cd sfm-web-harvester pip install -r requirements/requirements.txt
requirements/requirements.txt references the latest release of sfm-utils.
If you are doing development on the interaction between sfm-utils and sfm-web-harvester,
requirements/dev.txt. This uses a local copy of sfm-utils (
in editable mode.
Running as a service
Web harvester will act on harvest start messages received from a queue. To run as a service:
python web_harvester.py service <mq host> <mq username> <mq password> <heritrix url> <heritrix username> <heritrix password> <contact url>
Process harvest start files
Web harvester can process harvest start files. The format of a harvest start file is the same as a harvest start message. To run:
python flickr_harvester.py seed <path to file> <heritrix url> <heritrix username> <heritrix password> <contact url>
Integration tests (inside docker containers)
- Install Docker and Docker-Compose.
Start up the containers.
docker-compose -f docker/dev.docker-compose.yml up -d
Run the tests.
docker exec docker_sfmwebharvester_1 python -m unittest discover
docker-compose -f docker/dev.docker-compose.yml kill docker-compose -f docker/dev.docker-compose.yml rm -v --force
Harvest start messages
See the messaging specification for how to construct a harvest start message.