Keeps track of our web archiving collection, manages indexing and calculates statistics.
Usable for basic collection management but some functionality is unfinished or unpolished.
Bamboo stores a catalogue of crawls, statistics and indexing progress in a H2 or MySQL database which can be configured
via the following environment variables.
BAMBOO_DB_URL=jdbc:h2:mem:bamboo BAMBOO_DB_USER=bamboo BAMBOO_DB_PASSWORD=bamboo
The following environment variables tell Bamboo where to look for Heritrix jobs.
Bamboo has optional, limited integration with the National Library of Australia's legacy PANDAS 3 collection management
system. It can be enabled by building with
mvn -Ppandas and setting the following environment variables:
PANDAS_DB_URL= PANDAS_DB_USER= PANDAS_DB_PASSWORD=
Watching directories for changed WARCs
Bamboo can be configured to watch a directory for newly created or updated WARCs that follow the *.warc.gz.open
convention used by Heritrix, warcprox and other tools.
Bamboo will watch for newly created .warc.gz.open files, index them as new records are added and then move the WARC
into crawl 42's archive directory when the file is renamed to .warc.gz.