infrahelpers/optd-qa
Quality Assurance (QA) for Open Travel Data (OPTD)
277
Table of contents generated with markdown-toc
That repository features scripts to check the quality of the data files produced by the Open Travel Data (OPTD) project.
Though it is not there yet, that project should produce a Quality Assurance (QA) dashboard, much like Geonames' one.
And, hopefully, that dashboard will be powered by container images generated thanks to that repository as well.
Travis CI builds are partially covering the tests in https://travis-ci.com/opentraveldata/quality-assurance
Most of the scripts generate CSV data files, which can then be uploaded in databases (classical relational database systems (RDBMS) such as PostgreSQL or ElasticSearch (ES)), or served through standard Web applications. For historical reasons, some scripts may still generate JSON structures on the standard output. In the future, JSON should be used only for metadata, not for the data itself.
The CSV reports are published (thanks to Travis CI) to an OPTD-operated
ElasticSearch (ES) cluster. The full details on how to setup that ES cluster,
on Proxmox LXC containers, are given in a dedicated elasticsearch
tutorial.
For convenience, most of the ES examples are demonstrated both on a local single-node installation (e.g., on a laptop) and on on the above-mentioned cluster.
$ docker pull infrahelpers/optd-qa:latest
$ docker run --rm -it infrahelpers/optd-qa:latest bash
[build@8ce25cc20a10 opentraveldata-qa (master)] make checkers
[build@8ce25cc20a10 opentraveldata-qa (master)] exit
$ mkdir -p ~/dev/geo
$ git clone https://github.com/opentraveldata/quality-assurance.git ~/dev/geo/opentraveldata-qa
$ pushd ~/dev/geo/opentraveldata-qa
$ ./mkLocalDir.sh
$ popd
As detailed in the
online guide on how to set up a Python virtual environment,
Pyenv and
pipenv
should be installed,
and Python 3.9 installed thanks to Pyenv.
Then all the Python scripts will be run thanks to pipenv
.
pipenv
$ if [ ! -d ${HOME}/.pyenv ]; then pushd ${HOME} && git clone https://github.com/pyenv/pyenv.git $HOME/.pyenv && popd; else pushd ${HOME}/.pyenv && git pull && popd; fi
$ export PYENV_ROOT="${HOME}/.pyenv"; export PATH="${PYENV_ROOT}/.pyenv/shims:${PATH}"; if command -v pyenv 1>/dev/null 2>&1; then eval "$(pyenv init -)"; fi
$ pyenv install 3.9.1 && pyenv global 3.9.1 && pip install -U pip pipenv && pyenv global system
$ pushd ~/dev/geo/opentraveldata-qa
$ pipenv install
$ popd
$ pushd ~/dev/geo/opentraveldata-qa
$ pipenv update
$ git add Pipfile.lock
$ pipenv lock -r > requirements.txt
$ git add requirements.txt
$ git commit -m "[Python] Upgraded the Python dependencies"
$ git push
$ popd
ci-scripts/
directory of OPTD,
as the requirements.txt
file there
needs to upgraded accordingly (cloned from this repository)Makefile
to launch all the checkers (previous content may first
be removed, for instance if they have been generated another day):$ rm -f to_be_checked/* && rm -f results/*
$ make
pipenv
to launch specific Python scripts. For instance:$ pipenv run python checkers/check-por-cmp-optd-unlc.py
$ pipenv run python checkers/check-por-geo-id-in-optd.py
Makefile
approach:$ make results/optd-qa-por-optd-not-in-unlc.csv
pipenv run python checkers/check-por-cmp-optd-unlc.py && \
wc -l results/optd-qa-por-unlc-not-in-optd.csv results/optd-qa-por-optd-not-in-unlc.csv && head -3 results/optd-qa-por-unlc-not-in-optd.csv results/optd-qa-por-optd-not-in-unlc.csv
10324 results/optd-qa-por-unlc-not-in-optd.csv
124 results/optd-qa-por-optd-not-in-unlc.csv
10448 total
==> results/optd-qa-por-unlc-not-in-optd.csv <==
por_code^unlc_iata_code^unlc_ctry_code^unlc_state_code^unlc_short_code^unlc_name_utf8^unlc_name_ascii^unlc_coord_lat^unlc_coord_lon^unlc_change_code^unlc_status^unlc_is_port^unlc_is_rail^unlc_is_road^unlc_is_apt^unlc_is_postoff^unlc_is_icd^unlc_is_fxtpt^unlc_is_brdxing^unlc_is_unkwn
ADFMO^^AD^^FMO^La Farga de Moles^La Farga de Moles^^^^RQ^0^0^1^0^0^0^0^1^0
AEABU^^AE^^ABU^Abu al Bukhoosh^Abu al Bukhoosh^25.29^53.08^^RL^1^0^0^0^0^0^0^0^0
==> results/optd-qa-por-optd-not-in-unlc.csv <==
unlc_code^geo_id^fclass^fcode^geo_lat^geo_lon^iso31662_code^iso31662_name
AROBE^3430340^P^PPLA2^-27.48706^-55.11994^N^Misiones
AUREN^2155718^P^PPLX^-38.03333^145.3^VIC^Victoria
$ make results/optd-qa-por-best-not-in-geo.csv
pipenv run python checkers/check-por-geo-id-in-optd.py && \
wc -l results/optd-qa-por-best-not-in-geo.csv results/optd-qa-por-best-incst-code.csv results/optd-qa-por-dup-geo-id.csv results/optd-qa-por-cmp-geo-id.csv && head -3 results/optd-qa-por-best-not-in-geo.csv results/optd-qa-por-best-incst-code.csv results/optd-qa-por-dup-geo-id.csv results/optd-qa-por-cmp-geo-id.csv
616 results/optd-qa-por-best-not-in-geo.csv
1 results/optd-qa-por-best-incst-code.csv
1 results/optd-qa-por-dup-geo-id.csv
1 results/optd-qa-por-cmp-geo-id.csv
619 total
...
$ curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 66 100 21 100 45 82 175 --:--:-- --:--:-- --:--:-- 257
{
"acknowledged": true
}
$ ssh root@tiproxy8 -f -L9400:10.30.2.191:9200 sleep 5; curl -XPUT -H "Content-Type: application/json" http://localhost:9400/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 66 100 21 100 45 82 175 --:--:-- --:--:-- --:--:-- 257
{
"acknowledged": true
}
$ curl -XPOST "http://localhost:9200/_ingest/pipeline/_simulate" -H "Content-Type: application/json" --data "@elastic/optd-qa-pipeline-simulation-por-optd-geo-diff.json"|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1435 100 496 100 939 62000 114k --:--:-- --:--:-- --:--:-- 175k
{
"docs": [
{
"doc": {
"_index": "subway_info",
"_type": "_doc",
"_id": "AVvJZVQEBr2flFKzrrkr",
"_source": {
"iata_code": "DOH",
"optd_coord": {
"lon": "51.565056",
"lat": "25.261125"
},
"distance": "4.368154282573759",
"weighted_distance": "20197.72392862065",
"location_type": "C",
"geoname_id": "290030",
"country_code": "QA",
"page_rank": "0.4622857726179021",
"geo_coord": {
"lon": "51.53096",
"lat": "25.28545"
},
"adm1_code": "01",
"timestamp": "2020-03-20T15:12:23.000+01:00"
},
"_ingest": {
"timestamp": "2020-03-20T23:26:02.29742Z"
}
}
}
]
}
$ ssh root@tiproxy8 -f -L9400:10.30.2.191:9200 sleep 5; curl -XPOST "http://localhost:9400/_ingest/pipeline/_simulate" -H "Content-Type: application/json" --data "@elastic/optd-qa-pipeline-simulation-por-optd-geo-diff.json"|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1435 100 496 100 939 62000 114k --:--:-- --:--:-- --:--:-- 175k
{
...
}
$ curl -XPOST "http://localhost:9200/_ingest/pipeline/_simulate" -H "Content-Type: application/json" --data "@elastic/optd-qa-pipeline-simulation-por-optd-full.json"|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15053 100 8583 100 6470 128k 99538 --:--:-- --:--:-- --:--:-- 226k
{
"docs": [
{
"doc": {
"_index": "optd-qa-por-full",
"_type": "optd_qa_por_full",
"_id": "AAA1",
"_source": {
"continent_name": "Oceania",
"reporting_reason": "AAA-C-6947726 not found in optd",
"geoname_id": "6947726",
"adm2_name_ascii": "",
"moddate": "2012-04-29",
"adm1_code": "",
"asciiname": "Anaa Airport",
"city_name_list": "Anaa",
"fcode": "AIRP",
"adm2_code": "",
"wiki_link": "https://en.wikipedia.org/wiki/Anaa_Airport",
"unlc_list": "",
"population": "0",
"icao_code": "NTGA",
"date_until": "",
"country_code": "PF",
"alt_name_section": "ru|Анаа|=wkdt|Q1430785|",
"name": "Anaa Airport",
"uic_list": "",
"date_from": "",
"iata_code": "AAA",
"distance": "",
"timezone": "Pacific/Tahiti",
"is_geonames": "Y",
"dst_offset": "-10.0",
"coord": {
"lon": "-145.509956",
"lat": "-17.352606"
},
"adm4_code": "",
"ccy_code": "XPF",
"cc2": "",
"country_name": "French Polynesia",
"wac": "823",
"gtopo30": "8",
"adm1_name_ascii": "",
"timestamp": "2020-03-29T15:12:23.000+02:00",
"elevation": "",
"fclass": "S",
"faa_code": "",
"envelope_id": "",
"weighted_distance": "",
"tvl_por_list": "",
"adm2_name_utf": "",
"location_type": "A",
"geo_coord": {
"lon": "-145.51229",
"lat": "-17.34908"
},
"page_rank": "0.013618936015262465",
"adm1_name_utf": "",
"city_detail_list": "AAA|4034700|Anaa|Anaa",
"city_code_list": "AAA",
"wac_name": "French Polynesia",
"adm3_code": "",
"iso31662": "",
"comment": "",
"gmt_offset": "-10.0",
"raw_offset": "-10.0"
},
"_ingest": {
"timestamp": "2020-03-29T21:34:41.308529Z"
}
}
},
{
"doc": {
"_index": "optd-qa-por-full",
"_type": "optd_qa_por_full",
"_id": "BVD",
"_source": {
"continent_name": "North America",
"reporting_reason": "",
"geoname_id": "0",
"adm2_name_ascii": "",
"moddate": "-1",
"adm1_code": "",
"asciiname": "Beaver Inlet AK US Sea Port",
"city_name_list": "Beaver Inlet AK US Sea Port",
"fcode": "AIRP",
"adm2_code": "",
"wiki_link": "",
"unlc_list": "",
"population": "",
"icao_code": "",
"date_until": "",
"country_code": "US",
"alt_name_section": "",
"name": "Beaver Inlet AK US Sea Port",
"uic_list": "",
"date_from": "",
"iata_code": "BVD",
"distance": "",
"timezone": "America/Anchorage",
"is_geonames": "N",
"dst_offset": "",
"coord": {
"lon": "-147.4",
"lat": "66.36"
},
"adm4_code": "",
"ccy_code": "USD",
"cc2": "",
"country_name": "United States",
"wac": "1",
"gtopo30": "",
"adm1_name_ascii": "",
"timestamp": "2020-03-29T15:12:23.000+02:00",
"elevation": "",
"fclass": "S",
"faa_code": "",
"envelope_id": "",
"weighted_distance": "",
"tvl_por_list": "BVD",
"adm2_name_utf": "",
"location_type": "CA",
"geo_coord": {
"lon": "",
"lat": ""
},
"page_rank": "",
"adm1_name_utf": "",
"city_detail_list": "BVD|0|Beaver Inlet AK US Sea Port|Beaver Inlet AK US Sea Port",
"city_code_list": "BVD",
"wac_name": "Alaska",
"adm3_code": "",
"iso31662": "AK",
"comment": "",
"gmt_offset": "",
"raw_offset": ""
},
"_ingest": {
"timestamp": "2020-03-29T21:45:00.548234Z"
}
}
},
...
{
"doc": {
"_index": "optd-qa-por-full",
"_type": "optd_qa_por_full",
"_id": "BSL",
"_source": {
"continent_name": "Europe",
"reporting_reason": "",
"geoname_id": "6299466",
"adm2_name_ascii": "Haut-Rhin",
"moddate": "2020-03-15",
"adm1_code": "44",
"asciiname": "EuroAirport Basel-Mulhouse-Freiburg",
"city_name_list": "Basel",
"fcode": "AIRP",
"adm2_code": "68",
"wiki_link": "https://en.wikipedia.org/wiki/EuroAirport_Basel_Mulhouse_Freiburg",
"unlc_list": "CHBSL|=FRMLH|",
"population": "0",
"icao_code": "LFSB",
"date_until": "",
"country_code": "FR",
"alt_name_section": "es|Aeropuerto de Basilea-Mulhouse-Friburgo|=de|Flughafen Basel-Mülhausen|=it|Aeroporto di Basilea-Mulhouse-Friburgo|=ca|Aeroport de Basilea-Mulhouse-Friburg|=en|EuroAirport Basel–Mulhouse–Freiburg|p=fr|Aéroport de Bâle-Mulhouse-Fribourg|=wuu|巴塞尔-米卢斯-弗赖堡欧洲机场|=ru|Международный аэропорт Базель-Мюлуз-Фрайбург|=ja|ユーロエアポート|=fa|فرودگاه بازل-مولوز-فرایبورگ اروپا|",
"name": "EuroAirport Basel–Mulhouse–Freiburg",
"uic_list": "",
"date_from": "",
"iata_code": "BSL",
"distance": "",
"timezone": "Europe/Paris",
"is_geonames": "Y",
"dst_offset": "2.0",
"coord": {
"lon": "7.52991",
"lat": "47.58958"
},
"adm4_code": "68135",
"ccy_code": "EUR",
"cc2": "",
"country_name": "France",
"wac": "427",
"gtopo30": "263",
"adm1_name_ascii": "Grand Est",
"timestamp": "2020-03-29T15:12:23.000+02:00",
"elevation": "269",
"fclass": "S",
"faa_code": "",
"envelope_id": "",
"weighted_distance": "",
"tvl_por_list": "",
"adm2_name_utf": "Haut-Rhin",
"location_type": "A",
"geo_coord": {
"lon": "",
"lat": ""
},
"page_rank": "0.09830056026668005",
"adm1_name_utf": "Grand Est",
"city_detail_list": "BSL|2661604|Basel|Basel",
"city_code_list": "BSL",
"wac_name": "France",
"adm3_code": "684",
"iso31662": "GES",
"comment": "",
"gmt_offset": "1.0",
"raw_offset": "1.0"
},
"_ingest": {
"timestamp": "2020-03-29T21:34:41.308588Z"
}
}
},
{
"doc": {
"_index": "optd-qa-por-full",
"_type": "optd_qa_por_full",
"_id": "MLH",
"_source": {
"continent_name": "Europe",
"reporting_reason": "",
"geoname_id": "6299466",
"adm2_name_ascii": "Haut-Rhin",
"moddate": "2020-03-15",
"adm1_code": "44",
"asciiname": "EuroAirport Basel-Mulhouse-Freiburg",
"city_name_list": "Mulhouse",
"fcode": "AIRP",
"adm2_code": "68",
"wiki_link": "https://en.wikipedia.org/wiki/EuroAirport_Basel_Mulhouse_Freiburg",
"unlc_list": "CHBSL|=FRMLH|",
"population": "0",
"icao_code": "LFSB",
"date_until": "",
"country_code": "FR",
"alt_name_section": "es|Aeropuerto de Basilea-Mulhouse-Friburgo|=de|Flughafen Basel-Mülhausen|=it|Aeroporto di Basilea-Mulhouse-Friburgo|=ca|Aeroport de Basilea-Mulhouse-Friburg|=en|EuroAirport Basel–Mulhouse–Freiburg|p=fr|Aéroport de Bâle-Mulhouse-Fribourg|=wuu|巴塞尔-米卢斯-弗赖堡欧洲机场|=ru|Международный аэропорт Базель-Мюлуз-Фрайбург|=ja|ユーロエアポート|=fa|فرودگاه بازل-مولوز-فرایبورگ اروپا|",
"name": "EuroAirport Basel–Mulhouse–Freiburg",
"uic_list": "",
"date_from": "",
"iata_code": "MLH",
"distance": "",
"timezone": "Europe/Paris",
"is_geonames": "Y",
"dst_offset": "2.0",
"coord": {
"lon": "7.52991",
"lat": "47.58958"
},
"adm4_code": "68135",
"ccy_code": "EUR",
"cc2": "",
"country_name": "France",
"wac": "427",
"gtopo30": "263",
"adm1_name_ascii": "Grand Est",
"timestamp": "2020-03-29T15:12:23.000+02:00",
"elevation": "269",
"fclass": "S",
"faa_code": "",
"envelope_id": "",
"weighted_distance": "",
"tvl_por_list": "",
"adm2_name_utf": "Haut-Rhin",
"location_type": "A",
"geo_coord": {
"lon": "",
"lat": ""
},
"page_rank": "0.013945526285525285",
"adm1_name_utf": "Grand Est",
"city_detail_list": "MLH|2991214|Mulhouse|Mulhouse",
"city_code_list": "MLH",
"wac_name": "France",
"adm3_code": "684",
"iso31662": "GES",
"comment": "",
"gmt_offset": "1.0",
"raw_offset": "1.0"
},
"_ingest": {
docker pull infrahelpers/optd-qa