Public | Automated Build

Last pushed: 5 months ago
Short Description
Web Indexing with ElasticSearch
Full Description

WebIndexer

Web Indexing with ElasticSearch

How it works

Indexer will download a file

$HOST_PROTO://$HOST$LANGUAGE/$WEBINDEX_TXT

(i.e. https://www.domain.com/en/webindex.txt)

which includes all URIs under the domain that need to be indexed.

Example webindex.txt file:

/en/article/one
/en/article/two

Indexer will download each file from HOST and extract title (from HTML <titel> tag) and content.

content will be extracted from HTML tag with id="body", with these exceptions:

  • If page URI contains /help/, content will be extracted from id="gr-help-content" tag

Some special processing is done with title (remove product title).

Indexing will create an index called $HOST-$LANGUAGE (i.e. www.domain.com-en).
Documents will be named based on their URI (i.e. /en/article/one results in document en-article-one).

Environment Variables

  • ELASTIC_URL HTTP(S) URL to reach Elasticsearch server (default: http://elasticsearch:9200)
  • HOST Host name to index (explanation above)
  • HOST_PROTO Host protocol (default: https)
  • LANGUAGES Comma separated list of languages to index (explanation above, currently supported by setting the related Elasticsearch Analyzer: en,de,fr)
  • WEBINDEX_TXT Name of downloadable index containing one URI per line (explanation above, default: webindex.txt)
  • INDEX_DOCUMENT Document type (default: page)
  • TEST_QUERY Search query to try after indexing, otherwise ALIVE_URL will not be triggered.
  • TEST_MIN_HITS Search query needs to return at least hits for each index (default: 100).
  • ALIVE_URL URL to poll after successful run (optional)
  • ALIVE_USERNAME Basic Auth user name for ALIVE_URL
  • ALIVE_PASSWORD Basic Auth password for ALIVE_URL
Docker Pull Command
Owner
ograhl
Source Repository

Comments (0)