Public | Automated Build

Last pushed: 10 days ago
Short Description
A DCAT-AP validator for the Flemish Open Data Portal.
Full Description

#+TITLE: VODAP-DCATAP-VALIDATION

  • Introduction

The [[https://www.w3.org/TR/vocab-dcat/][Data Catalog Vocabulary]] (DCAT) is an RDF schema targeting the
exchanging of data catalogs (e.g. for harvesting, etc.). It defines
the common meta-data relationships and properties to describe data
source references or indeed a set of references. [[https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-ap-v11][DCAT-AP]] is a
specialisation of DCAT targetting European public sector data portals.

Making such meta-data catalogs available as RDF is typically an
incremental operation, for which some form of validator will greatly
improve the speed and usefullness of the created catalog (reducing the
turn around time). The vodap_validator is a variant of the
[[https://github.com/SEMICeu/dcat-ap_validator][dcat-ap_validator]] tool which allows this validation to be performed
on demand and which will produce a report based on executing sparql
queries on the RDF data source.

The main entry point to the system is a HTML form, which will accepts
a URL reference to an RDF catalog. This will be accessible via the
link, when using the vagrant environment:

http://localhost:80/vodap_validator/

The catalog to be downloaded must be a single publically accessible
document containing RDF in XML or NTriple format. Submitting the form
triggers a downloading of the referenced catalog, execution of the
validation rules and the generation of a HTML report.
** System Outline

To reduce the deployment costs and issues, docker and docker-compose
have been used to co-ordinate the deployment of the webservice and the
associated virtuoso service. Virtuoso is used as the temporary work
RDF store against which the validation rules are executed.

The 'webservice' docker is the main component and this contains an
apache webserver with all the required scripts and tools (i.e. bash,
etc.) required to download the entered URL, validate the catalog and
create the report (which is based on using org-mode as the markup
language before converting it to HTML).

As this is a web-based service, the speed of the catalog downloading
will impact greatly on the performance of the overall system (as will
the number of errors and warnings which have to be gathered and
reported on).

A vagrant description has also been provided for development purposes
which will give a provision a complete machine on which to run the
docker components (using virtualbox on a windows machine). Configuring
the service to be acessible via the internet requires changes in the
Apache configuration rules.

** Report Sections

As indicated above when the form is submitted, the contents of the
data source (pointed to using the URL) will be downloaded and inserted
into the RDF store (virtuoso). Against this store (using a specific
timestamp graph), each of the SPARQL validation rules will be executed and the
resulting details recorded in a CSV file (all the rules have comment
output).

Once the CSV file containing the results available, it is passed to
the report generation script to construct a report based the
information available (the initial format is org-mode which is then
converted to HTML using emacs in batch mode). The HTML report is a
variant of the 'readthedocs' form which is created using an org-mode
file. Each section in the report relates to a specific error or
warning (the number of errors in each section is restricted though to
the first 50).

  • Deployment
    ** New machine setup using vagrant

To simplify development and to test the deployment, [[http:://vagrant.com][Vagrant]] is used to
create and provision an up-to-date ubuntu/debian Linux virtual
machine.

Assuming [[https://www.vagrantup.com]["Vagrant"]] is installed the following is all that should be
needed is:

#+BEGIN_EXAMPLE
vagrant up

#+END_EXAMPLE

This will start up virtualbox, install the expected Ubuntu image, and
then installs the first the necessary system tools: docker,
git,... defined in bootstrap.sh. The application dependencies are
collected in the subsequent script:

#+BEGIN_EXAMPLE
application_bootstrap.sh.

#+END_EXAMPLE

** On an existing machine

On an existing UBUNTU system the setup starts from the script:

#+BEGIN_EXAMPLE
application_bootstrap.sh

#+END_EXAMPLE

By running this (as sudo) the necessary application dependencies are
installed (requires internet access) and setup.

  • Execution
    [Needs cleaning up]
    The commands are initiated through the make command.

    • make startupvirtuoso : starts a local virtuoso
    • make loadCatalog : loads the catalog on disk into virtuoso
    • make vodapreport : creates the report in html based on the content in the virtuoso and the activated rules

Commands to manage the catalog state

- make rmCatalog: removes the latest catalog aggregated file from disk 
- make cleanCatalog: removes the latest catalog from disk
- make createCatalog: create the latest catalog (from the existing downloads, if present)

Commands to manage the set of validation rules

- make VODAPrules     : use the rules for VODAP
- make ISAVODAPrules  : use the rules from ISA adapted to VODAP (SPARQL) case
- make ISArules       : use the rules from ISA (as is)
  • As Webservice
    * Building the webservice
    The reconsitory contains a number of docker-compose
    .yml files. The first is the
    production environment, but the -dev.yml is one which overrides serveral
    environment settings within the vagrant environment (to make it feel like
    browsing the target production environment).

The first task here, is to make a development and production
docker-compose.

The creation of a new webservice to test locally is done as follows:

+ Firstly ensure that the application_bootstrap.sh has been run.
+ (re)build the service using the following

#+BEGIN_EXAMPLE

docker-compose -f docker-compose.yml build

#+END_EXAMPLE

+ stopping and starting the previous build
#+BEGIN_EXAMPLE

docker-compose -f docker-compose.yml down
docker-compose -f docker-compose.yml up -d

#+END_EXAMPLE

** deploying the webservice
[TODO]: make a development and production docker-compose

Deploying the ready made build is as easy as the following

#+BEGIN_EXAMPLE
wget https://raw.githubusercontent.com/tenforce/vodap-dcatap-validation/master/docker-compose.yml
#+END_EXAMPLE

This docker-compose file contains the VODAP default settings 

#+BEGIN_EXAMPLE
docker-compose -f docker-compose.yml up -d
#+END_EXAMPLE

To extend the environment there is a 'docker-compose-dev.yml'
file which will override some of the options used in the 
main 'docker-compose.yml' file. To use the file:

#+BEGIN_EXAMPLE
docker-compose -f docker-compose.yml -f docker-compose-dev.yml <command>
#+END_EXAMPLE

Note: The docker-compose command 'extends' has problems with the
'links' so cannot be used to simplify the above.

** Startup of the webservice
The vodap_validator can be deployed as a webservice using the
docker-compose file. All that should be needed the first time is:

#+BEGIN_EXAMPLE
cd /vagrant # if using the vagrant machine
docker-compose up -d

#+END_EXAMPLE

Once started, browsing to http://localhost/vodap_validator should
results in a simple webform being visible. The form expects the URL of
the catalog in DCAT-AP form serialized in RDF to be validated. It
supports serializations in ntriples, turtle and RDF/XML.

Once validated the, webservice will forward then the user to a
timestamped directory containing validation report:

The validation consists of 2 levels:

  • RDF serialization compliance: is the provided file RDF compliant.
    Typical issues are incorrect URI identifiers containing for example
    spaces.
  • DCAT-AP validation: is the provided file DCAT-AP(VO) compliant.
  • note
    if the download size of the catalog is too large increase this value
    MaxDataSourceSize = 20971520 . Controls the max size that can be sponged. Default is 20 MB.

  • Acknowledgements

Docker Pull Command
Owner
bertvannuffelen

Comments (0)