Public | Automated Build

Last pushed: a year ago
Short Description
ArchiveSpark facilitates access to Web Archives, enables easy data extraction, and derivation.
Full Description


ArchiveSpark is a framework that facilitates access to Web Archives and enables easy data extraction as well as derivation. This repository makes ArchiveSpark available as a Docker image. We made it easy to run ArchiveSpark as a Docker container without the need of a Hadoop cluster and without the need of installing and configuring various dependencies. To get started, run the following command (assuming that Docker is installed on the machine):

$ docker run --rm -it -p 8888:8888 ibnesayeed/archivespark

Then open a browser and access the Notebook at http://localhost:8888/. There should be an example notebook (example.ipynb) available to play with the included example CDX and WARC files.

To run the container with CDX and WARC files other than included examples, mount corresponding directories in the container at run time. Additionally, to make the notebooks available outside of the container, mount a notebook directory of the host machine with write permissions. Following command illustrates the usage:

$ docker run --rm -it -p 8888:8888 -v /path/to/cdx:/cdx -v /path/to/warc:/warc -v /path/to/notebooks:/notebooks ibnesayeed/archivespark
Docker Pull Command
Source Repository