infrahelpers/optd-qa

By infrahelpers

Updated about 3 years ago

Quality Assurance (QA) for Open Travel Data (OPTD)

Image

277

Quality Assurance (QA) for OpenTravelData (OPTD)

GitHub Pipenv locked Python versionCI build StatusDocker Cloud build statusContainer repository on QuayGitHub Pipenv locked Python versionGitHub Pipenv locked dependency versionGitHub Pipenv locked dependency versionGitHub Pipenv locked dependency versionGitHub Pipenv locked dependency version

Table of Content (ToC)

Table of contents generated with markdown-toc

Overview

That repository features scripts to check the quality of the data files produced by the Open Travel Data (OPTD) project.

Though it is not there yet, that project should produce a Quality Assurance (QA) dashboard, much like Geonames' one.

And, hopefully, that dashboard will be powered by container images generated thanks to that repository as well.

Travis CI builds are partially covering the tests in https://travis-ci.com/opentraveldata/quality-assurance

Most of the scripts generate CSV data files, which can then be uploaded in databases (classical relational database systems (RDBMS) such as PostgreSQL or ElasticSearch (ES)), or served through standard Web applications. For historical reasons, some scripts may still generate JSON structures on the standard output. In the future, JSON should be used only for metadata, not for the data itself.

The CSV reports are published (thanks to Travis CI) to an OPTD-operated ElasticSearch (ES) cluster. The full details on how to setup that ES cluster, on Proxmox LXC containers, are given in a dedicated elasticsearch tutorial.

For convenience, most of the ES examples are demonstrated both on a local single-node installation (e.g., on a laptop) and on on the above-mentioned cluster.

See also

ElasticSearch (ES)

Ingest processors

Quick starter

Through a pre-built Docker image

  • Retrieve the Docker image:
$ docker pull infrahelpers/optd-qa:latest
  • Launch the Docker-powered scripts:
$ docker run --rm -it infrahelpers/optd-qa:latest bash
[build@8ce25cc20a10 opentraveldata-qa (master)] make checkers
[build@8ce25cc20a10 opentraveldata-qa (master)] exit

Installation

With a manually built Docker image

Through a local cloned Git repository (without Docker)

$ mkdir -p ~/dev/geo
$ git clone https://github.com/opentraveldata/quality-assurance.git ~/dev/geo/opentraveldata-qa
$ pushd ~/dev/geo/opentraveldata-qa
$ ./mkLocalDir.sh
$ popd

On the local environment (without Docker)

As detailed in the online guide on how to set up a Python virtual environment, Pyenv and pipenv should be installed, and Python 3.9 installed thanks to Pyenv. Then all the Python scripts will be run thanks to pipenv.

Pyenv and pipenv
  • As a summary of what has been detailed in above-mentioned how-to (and which needs only to be done once and for all):
$ if [ ! -d ${HOME}/.pyenv ]; then pushd ${HOME} && git clone https://github.com/pyenv/pyenv.git $HOME/.pyenv && popd; else pushd ${HOME}/.pyenv && git pull && popd; fi
$ export PYENV_ROOT="${HOME}/.pyenv"; export PATH="${PYENV_ROOT}/.pyenv/shims:${PATH}"; if command -v pyenv 1>/dev/null 2>&1; then eval "$(pyenv init -)"; fi
$ pyenv install 3.9.1 && pyenv global 3.9.1 && pip install -U pip pipenv && pyenv global system
$ pushd ~/dev/geo/opentraveldata-qa
$ pipenv install
$ popd
  • To update the Python dependencies:
$ pushd ~/dev/geo/opentraveldata-qa
$ pipenv update
$ git add Pipfile.lock
$ pipenv lock -r > requirements.txt
$ git add requirements.txt
$ git commit -m "[Python] Upgraded the Python dependencies"
$ git push
$ popd

Launch the Python checkers

  • Use the Makefile to launch all the checkers (previous content may first be removed, for instance if they have been generated another day):
$ rm -f to_be_checked/* && rm -f results/*
$ make
  • Use pipenv to launch specific Python scripts. For instance:
$ pipenv run python checkers/check-por-cmp-optd-unlc.py
$ pipenv run python checkers/check-por-geo-id-in-optd.py
  • Or use a convenient shortcut provided by the Makefile approach:
$ make results/optd-qa-por-optd-not-in-unlc.csv
pipenv run python checkers/check-por-cmp-optd-unlc.py && \
	wc -l results/optd-qa-por-unlc-not-in-optd.csv results/optd-qa-por-optd-not-in-unlc.csv && head -3 results/optd-qa-por-unlc-not-in-optd.csv results/optd-qa-por-optd-not-in-unlc.csv
   10324 results/optd-qa-por-unlc-not-in-optd.csv
     124 results/optd-qa-por-optd-not-in-unlc.csv
   10448 total
==> results/optd-qa-por-unlc-not-in-optd.csv <==
por_code^unlc_iata_code^unlc_ctry_code^unlc_state_code^unlc_short_code^unlc_name_utf8^unlc_name_ascii^unlc_coord_lat^unlc_coord_lon^unlc_change_code^unlc_status^unlc_is_port^unlc_is_rail^unlc_is_road^unlc_is_apt^unlc_is_postoff^unlc_is_icd^unlc_is_fxtpt^unlc_is_brdxing^unlc_is_unkwn
ADFMO^^AD^^FMO^La Farga de Moles^La Farga de Moles^^^^RQ^0^0^1^0^0^0^0^1^0
AEABU^^AE^^ABU^Abu al Bukhoosh^Abu al Bukhoosh^25.29^53.08^^RL^1^0^0^0^0^0^0^0^0

==> results/optd-qa-por-optd-not-in-unlc.csv <==
unlc_code^geo_id^fclass^fcode^geo_lat^geo_lon^iso31662_code^iso31662_name
AROBE^3430340^P^PPLA2^-27.48706^-55.11994^N^Misiones
AUREN^2155718^P^PPLX^-38.03333^145.3^VIC^Victoria

$ make results/optd-qa-por-best-not-in-geo.csv 
pipenv run python checkers/check-por-geo-id-in-optd.py && \
	wc -l results/optd-qa-por-best-not-in-geo.csv results/optd-qa-por-best-incst-code.csv results/optd-qa-por-dup-geo-id.csv results/optd-qa-por-cmp-geo-id.csv && head -3 results/optd-qa-por-best-not-in-geo.csv results/optd-qa-por-best-incst-code.csv results/optd-qa-por-dup-geo-id.csv results/optd-qa-por-cmp-geo-id.csv
     616 results/optd-qa-por-best-not-in-geo.csv
       1 results/optd-qa-por-best-incst-code.csv
       1 results/optd-qa-por-dup-geo-id.csv
       1 results/optd-qa-por-cmp-geo-id.csv
     619 total
...

Elasticsearch

Re-set the read-write property of indices
  • Local installation:
$ curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    66  100    21  100    45     82    175 --:--:-- --:--:-- --:--:--   257
{
  "acknowledged": true
}
  • Remote installation:
$ ssh root@tiproxy8 -f -L9400:10.30.2.191:9200 sleep 5; curl -XPUT -H "Content-Type: application/json" http://localhost:9400/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    66  100    21  100    45     82    175 --:--:-- --:--:-- --:--:--   257
{
  "acknowledged": true
}
Simplified pipeline and index
$ curl -XPOST "http://localhost:9200/_ingest/pipeline/_simulate" -H "Content-Type: application/json" --data "@elastic/optd-qa-pipeline-simulation-por-optd-geo-diff.json"|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1435  100   496  100   939  62000   114k --:--:-- --:--:-- --:--:--  175k
{
  "docs": [
    {
      "doc": {
        "_index": "subway_info",
        "_type": "_doc",
        "_id": "AVvJZVQEBr2flFKzrrkr",
        "_source": {
          "iata_code": "DOH",
          "optd_coord": {
            "lon": "51.565056",
            "lat": "25.261125"
          },
          "distance": "4.368154282573759",
          "weighted_distance": "20197.72392862065",
          "location_type": "C",
          "geoname_id": "290030",
          "country_code": "QA",
          "page_rank": "0.4622857726179021",
          "geo_coord": {
            "lon": "51.53096",
            "lat": "25.28545"
          },
          "adm1_code": "01",
          "timestamp": "2020-03-20T15:12:23.000+01:00"
        },
        "_ingest": {
          "timestamp": "2020-03-20T23:26:02.29742Z"
        }
      }
    }
  ]
}
$ ssh root@tiproxy8 -f -L9400:10.30.2.191:9200 sleep 5; curl -XPOST "http://localhost:9400/_ingest/pipeline/_simulate" -H "Content-Type: application/json" --data "@elastic/optd-qa-pipeline-simulation-por-optd-geo-diff.json"|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1435  100   496  100   939  62000   114k --:--:-- --:--:-- --:--:--  175k
{
  ...
}
POR full index and pipeline
$ curl -XPOST "http://localhost:9200/_ingest/pipeline/_simulate" -H "Content-Type: application/json" --data "@elastic/optd-qa-pipeline-simulation-por-optd-full.json"|jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15053  100  8583  100  6470   128k  99538 --:--:-- --:--:-- --:--:--  226k
{
  "docs": [
    {
      "doc": {
        "_index": "optd-qa-por-full",
        "_type": "optd_qa_por_full",
        "_id": "AAA1",
        "_source": {
          "continent_name": "Oceania",
          "reporting_reason": "AAA-C-6947726 not found in optd",
          "geoname_id": "6947726",
          "adm2_name_ascii": "",
          "moddate": "2012-04-29",
          "adm1_code": "",
          "asciiname": "Anaa Airport",
          "city_name_list": "Anaa",
          "fcode": "AIRP",
          "adm2_code": "",
          "wiki_link": "https://en.wikipedia.org/wiki/Anaa_Airport",
          "unlc_list": "",
          "population": "0",
          "icao_code": "NTGA",
          "date_until": "",
          "country_code": "PF",
          "alt_name_section": "ru|Анаа|=wkdt|Q1430785|",
          "name": "Anaa Airport",
          "uic_list": "",
          "date_from": "",
          "iata_code": "AAA",
          "distance": "",
          "timezone": "Pacific/Tahiti",
          "is_geonames": "Y",
          "dst_offset": "-10.0",
          "coord": {
            "lon": "-145.509956",
            "lat": "-17.352606"
          },
          "adm4_code": "",
          "ccy_code": "XPF",
          "cc2": "",
          "country_name": "French Polynesia",
          "wac": "823",
          "gtopo30": "8",
          "adm1_name_ascii": "",
          "timestamp": "2020-03-29T15:12:23.000+02:00",
          "elevation": "",
          "fclass": "S",
          "faa_code": "",
          "envelope_id": "",
          "weighted_distance": "",
          "tvl_por_list": "",
          "adm2_name_utf": "",
          "location_type": "A",
          "geo_coord": {
            "lon": "-145.51229",
            "lat": "-17.34908"
          },
          "page_rank": "0.013618936015262465",
          "adm1_name_utf": "",
          "city_detail_list": "AAA|4034700|Anaa|Anaa",
          "city_code_list": "AAA",
          "wac_name": "French Polynesia",
          "adm3_code": "",
          "iso31662": "",
          "comment": "",
          "gmt_offset": "-10.0",
          "raw_offset": "-10.0"
        },
        "_ingest": {
          "timestamp": "2020-03-29T21:34:41.308529Z"
        }
      }
    },
    {
      "doc": {
        "_index": "optd-qa-por-full",
        "_type": "optd_qa_por_full",
        "_id": "BVD",
        "_source": {
          "continent_name": "North America",
          "reporting_reason": "",
          "geoname_id": "0",
          "adm2_name_ascii": "",
          "moddate": "-1",
          "adm1_code": "",
          "asciiname": "Beaver Inlet AK US Sea Port",
          "city_name_list": "Beaver Inlet AK US Sea Port",
          "fcode": "AIRP",
          "adm2_code": "",
          "wiki_link": "",
          "unlc_list": "",
          "population": "",
          "icao_code": "",
          "date_until": "",
          "country_code": "US",
          "alt_name_section": "",
          "name": "Beaver Inlet AK US Sea Port",
          "uic_list": "",
          "date_from": "",
          "iata_code": "BVD",
          "distance": "",
          "timezone": "America/Anchorage",
          "is_geonames": "N",
          "dst_offset": "",
          "coord": {
            "lon": "-147.4",
            "lat": "66.36"
          },
          "adm4_code": "",
          "ccy_code": "USD",
          "cc2": "",
          "country_name": "United States",
          "wac": "1",
          "gtopo30": "",
          "adm1_name_ascii": "",
          "timestamp": "2020-03-29T15:12:23.000+02:00",
          "elevation": "",
          "fclass": "S",
          "faa_code": "",
          "envelope_id": "",
          "weighted_distance": "",
          "tvl_por_list": "BVD",
          "adm2_name_utf": "",
          "location_type": "CA",
          "geo_coord": {
            "lon": "",
            "lat": ""
          },
          "page_rank": "",
          "adm1_name_utf": "",
          "city_detail_list": "BVD|0|Beaver Inlet AK US Sea Port|Beaver Inlet AK US Sea Port",
          "city_code_list": "BVD",
          "wac_name": "Alaska",
          "adm3_code": "",
          "iso31662": "AK",
          "comment": "",
          "gmt_offset": "",
          "raw_offset": ""
        },
        "_ingest": {
          "timestamp": "2020-03-29T21:45:00.548234Z"
        }
      }
    },
	...
    {
      "doc": {
        "_index": "optd-qa-por-full",
        "_type": "optd_qa_por_full",
        "_id": "BSL",
        "_source": {
          "continent_name": "Europe",
          "reporting_reason": "",
          "geoname_id": "6299466",
          "adm2_name_ascii": "Haut-Rhin",
          "moddate": "2020-03-15",
          "adm1_code": "44",
          "asciiname": "EuroAirport Basel-Mulhouse-Freiburg",
          "city_name_list": "Basel",
          "fcode": "AIRP",
          "adm2_code": "68",
          "wiki_link": "https://en.wikipedia.org/wiki/EuroAirport_Basel_Mulhouse_Freiburg",
          "unlc_list": "CHBSL|=FRMLH|",
          "population": "0",
          "icao_code": "LFSB",
          "date_until": "",
          "country_code": "FR",
          "alt_name_section": "es|Aeropuerto de Basilea-Mulhouse-Friburgo|=de|Flughafen Basel-Mülhausen|=it|Aeroporto di Basilea-Mulhouse-Friburgo|=ca|Aeroport de Basilea-Mulhouse-Friburg|=en|EuroAirport Basel–Mulhouse–Freiburg|p=fr|Aéroport de Bâle-Mulhouse-Fribourg|=wuu|巴塞尔-米卢斯-弗赖堡欧洲机场|=ru|Международный аэропорт Базель-Мюлуз-Фрайбург|=ja|ユーロエアポート|=fa|فرودگاه بازل-مولوز-فرایبورگ اروپا|",
          "name": "EuroAirport Basel–Mulhouse–Freiburg",
          "uic_list": "",
          "date_from": "",
          "iata_code": "BSL",
          "distance": "",
          "timezone": "Europe/Paris",
          "is_geonames": "Y",
          "dst_offset": "2.0",
          "coord": {
            "lon": "7.52991",
            "lat": "47.58958"
          },
          "adm4_code": "68135",
          "ccy_code": "EUR",
          "cc2": "",
          "country_name": "France",
          "wac": "427",
          "gtopo30": "263",
          "adm1_name_ascii": "Grand Est",
          "timestamp": "2020-03-29T15:12:23.000+02:00",
          "elevation": "269",
          "fclass": "S",
          "faa_code": "",
          "envelope_id": "",
          "weighted_distance": "",
          "tvl_por_list": "",
          "adm2_name_utf": "Haut-Rhin",
          "location_type": "A",
          "geo_coord": {
            "lon": "",
            "lat": ""
          },
          "page_rank": "0.09830056026668005",
          "adm1_name_utf": "Grand Est",
          "city_detail_list": "BSL|2661604|Basel|Basel",
          "city_code_list": "BSL",
          "wac_name": "France",
          "adm3_code": "684",
          "iso31662": "GES",
          "comment": "",
          "gmt_offset": "1.0",
          "raw_offset": "1.0"
        },
        "_ingest": {
          "timestamp": "2020-03-29T21:34:41.308588Z"
        }
      }
    },
    {
      "doc": {
        "_index": "optd-qa-por-full",
        "_type": "optd_qa_por_full",
        "_id": "MLH",
        "_source": {
          "continent_name": "Europe",
          "reporting_reason": "",
          "geoname_id": "6299466",
          "adm2_name_ascii": "Haut-Rhin",
          "moddate": "2020-03-15",
          "adm1_code": "44",
          "asciiname": "EuroAirport Basel-Mulhouse-Freiburg",
          "city_name_list": "Mulhouse",
          "fcode": "AIRP",
          "adm2_code": "68",
          "wiki_link": "https://en.wikipedia.org/wiki/EuroAirport_Basel_Mulhouse_Freiburg",
          "unlc_list": "CHBSL|=FRMLH|",
          "population": "0",
          "icao_code": "LFSB",
          "date_until": "",
          "country_code": "FR",
          "alt_name_section": "es|Aeropuerto de Basilea-Mulhouse-Friburgo|=de|Flughafen Basel-Mülhausen|=it|Aeroporto di Basilea-Mulhouse-Friburgo|=ca|Aeroport de Basilea-Mulhouse-Friburg|=en|EuroAirport Basel–Mulhouse–Freiburg|p=fr|Aéroport de Bâle-Mulhouse-Fribourg|=wuu|巴塞尔-米卢斯-弗赖堡欧洲机场|=ru|Международный аэропорт Базель-Мюлуз-Фрайбург|=ja|ユーロエアポート|=fa|فرودگاه بازل-مولوز-فرایبورگ اروپا|",
          "name": "EuroAirport Basel–Mulhouse–Freiburg",
          "uic_list": "",
          "date_from": "",
          "iata_code": "MLH",
          "distance": "",
          "timezone": "Europe/Paris",
          "is_geonames": "Y",
          "dst_offset": "2.0",
          "coord": {
            "lon": "7.52991",
            "lat": "47.58958"
          },
          "adm4_code": "68135",
          "ccy_code": "EUR",
          "cc2": "",
          "country_name": "France",
          "wac": "427",
          "gtopo30": "263",
          "adm1_name_ascii": "Grand Est",
          "timestamp": "2020-03-29T15:12:23.000+02:00",
          "elevation": "269",
          "fclass": "S",
          "faa_code": "",
          "envelope_id": "",
          "weighted_distance": "",
          "tvl_por_list": "",
          "adm2_name_utf": "Haut-Rhin",
          "location_type": "A",
          "geo_coord": {
            "lon": "",
            "lat": ""
          },
          "page_rank": "0.013945526285525285",
          "adm1_name_utf": "Grand Est",
          "city_detail_list": "MLH|2991214|Mulhouse|Mulhouse",
          "city_code_list": "MLH",
          "wac_name": "France",
          "adm3_code": "684",
          "iso31662": "GES",
          "comment": "",
          "gmt_offset": "1.0",
          "raw_offset": "1.0"
        },
        "_ingest": {
   

Docker Pull Command

docker pull infrahelpers/optd-qa