ukwa/warcprox
Run warcprox inside Docker
1.7K
This project should allow proxy-based web archiving to be used on large scale crawls by scaling it out behind a proxying load balancer. The load balancer attempts to route based on the URL, so that the same URLs are always routed to the same warcprox instance, thus ensuring deduplication works as expected without having to share state between the archiving proxies.
hdr(host)
, uri
, etc. (but not in TCP mode).To experiment with scaling out, first clean out any existing machines:
$ docker-compose rm
Then define how many warcprox instances you want and ask for them to be configured:
$ docker-compose scale warcprox=3
Then when you run
$ docker-compose up
The system will start up and configure a HAProxy instance that is configured to balance the load across all the warcprox instances. The provided configuration divides the load up using hdr(host)
, which send all requests relating to a particular host to the same warcprox instance. This ensures that URL-based de-duplication can work effectively. Further experimentation with the load balancing parameters is recommended.
[Memento-Datetime](https://github.com/mementoweb/timegate/wiki/HTTP-Response-Headers)
to the response. Use that to indicate that the archiving should have worked, and then pass it along to another queue for checking later on. ALTERNATIVELY (in case of collisions etc.) use a time-based UUID or similar to be a WARC-Record-ID
and add this in a separate Warcprox-WARC-Record-ID:
header. That record ID can then be tracked, although this will require a new index rather than leveraging the CDX.tinycdxserver
.docker pull ukwa/warcprox