This image runs the import part of Samuel Lai's Stackdump project:
Given a site directory (a directory containing the extracted XML files of a site from the StackExchange data dumps), and information for a site (given as
-e parameters), the container calls the import_site script of the Stackdump project, thus indexing and storing the site data.
Running the container
Note: this is a very ad-hoc, unoptimized image (at the moment). Thus, it works, but many parameters must be supplied to the
docker run comand.
In order for this container to work correctly with the other stackdump containers (solr and webserver), it needs a volume mount from an outside data folder to /opt/stackdump/data (in the container). The outside data folder must be available for (and used by) the other two containers as well.
For this image, a second mount is required - the site directory (i.e. the directory containing the .xml files of the site) should be mounted at /var/stackdump-site (in the container).
Additionally, the following parameters must be passed as
-e values to the
docker run command:
- SITE_NAME: e.g. "android.stackexchange.com"
- SITE_DESCRIPTION: the description for the site. NOTE that this parameter and the DUMP_DATE parameter don't work well with spaces, sadly.
Example parameter: "android_stack_exchange"
- DUMP_DATE: the date of the data's creation (available at the dump site). Again, can't use spaces at the moment. Example: "March2017"
- SITE_KEY: an identifier for the site. Example: "android"
- BASE_URL: self-explanatory. Example: "android.stackexchange.com"
As far as hardware goes, the container needs a system with a fairly large amount of RAM, as the import process is costly (and can take time). At least 6GB of RAM is recommended.
The good news are that this container can run while the solr and webserver containers are running. After the container finishes - which may take a (large) while for large sites.
(Example times: on the March 2017 dump, importing beer.stackexchange.com took about 2 minutes on my machine, while importing askubuntu.com took about 40 minutes.)
Finally, all 3 containers must exist on the same host with the same network for them to be able to interact (i.e.
--net=host must be supplied to the
docker run command).
docker run -it --name stackdump-importer --net=host -v /home/user/data_stackdump:/opt/stackdump/data -v /home/user/my-site:/var/stackdump-site -e SITE_NAME="android.stackexchange.com" -e SITE_DESCRIPTION="android_stack_exchange" -e SITE_KEY="android" -e BASE_URL="android.stackexchange.com" -e DUMP_DATE="February2017" udidoron/stackdump-importer:latest
All credits for the original Stackdump project go to Samuel Lai. The Solr project belongs to the Apache software foundation.