The project experiments with ways to generate data processing piplelines.
The aim is to generate some re-usable building blocks that can be piped
together into more functional pipelines. Their prime initial use is as executors
for the Squonk Computational Notebook (http://squonk.it) though it is expected
that they will have uses in other environments.
As well as being executable directly they can also be executed in Docker
containers (separately or as a single pipeline). Additionally they can be
executed using Nextflow (http://nextflow.io) to allow running large jobs
on HPC-like environments.
Currently it has some python scripts using RDKit (http://rdkit.org) to provide
basic cheminformatics and comp chem functionality, though other tools will
be coming soon, including some from the Java ecosystem.
- See here for more info on the RDKit components.
- See here for more info on running these in Nextflow.
Note: this is experimental, subject to change, and there are no guarantees that things work as expected!
That said, its already proved to be highly useful in the Squonk Computational Notebook, and if you are interested let us know, and join the fun.
The code is licensed under the Apache 2.0 license.
In Jan 2018 some of the core functionality from this repository was broken out into the pipeline-utils repository. This included utility Python modules, as well as creation of a test framework that makes it easier to create and test new modules. This change also makes it easier to create additonal pipeline-like projects. See the Readme in the pipeline-utils repo for more details.
Each component should be small but useful. Try to split complex tasks into
reusable steps. Think how the same steps could be used in other workflows.
Allow parts of one component to be used in another component where appropriate
but avoid over use. For example see the use of functions in rdkit/conformers.py
to generate conformers in o3dAlign.py
Consistent approach to how components function, regarding:
- Use as simple command line tools that can be piped together
- Input and outputs either as files of using STDIN and STDOUT
- Any info/logging written to STDERR to keep STDOUT free for output
- Consistent approach to command line arguments across components
Generally use consistent coding styles e.g. PEP8 for Python.
Input and output formats
We aim to provide consistent input and output formats to allow results to be
passed between different implementations. Currently all implementations handle
chemical structures so SD file would typically be used as the lowest common
denominator interchange format, but implementations should also try to support
Squonk's JSON based Dataset formats, which potentially allow richer representations
and can be used to describe data other than chemical structures.
The utils.py module provides helper methods to handle IO.
In addition implementations are encouraged to support "thin" output formats
where this is appropriate. A "thin" representation is a minimal representation
containing only what is new or changed, and can significantly reduce the bandwith
used and avoid the need for the consumer to interpret values it does not
need to understand. It is not always appropriate to support thin format output
(e.g. when the structure is changed by the process).
In the case of SDF thin format involves using an empty molecule for the molecule
block and all properties that were present in the input or were generated by the
process (the empty molecule is used so that the SDF syntax remains valid).
In the case of Squonk JSON output the thin output would be of type BasicObject
(e.g. containing no structure information) and include all properties that
were present in the input or were generated by the process.
Implicit in this is that some identifier (usually a SD file property, or
the JSON UUID property) that is present in the input is included in the output so
that the full results can be "reassembled" by the consumer of the output.
The input would typically only contain additional information that is required
for execution of the process e.g. the structure.
For consistency implementations should try to honor these command line
switches relating to input and output:
-i and --input: For specifying the location of the single input. If not specified
then STDIN should be used. File names ending with .gz should be interpreted as
gzipped files. Input on STDIN should not be gzipped.
-if and --informat: For specifying the input format where it cannot be inferred
from the file name (e.g. when using STDIN). Values would be sdf or json.
-o and --output: For specifying the base name of the ouputs (there could be multiple
output files each using the same base name but with a different file extension.
If not specified then STDOUT should be used. Output file names ending with
.gz should be compressed using gzip. Output on STDOUT would not be gzipped.
-of and --outformat: For specifying the output format where it cannot be inferred
from the file name (e.g. when using STDOUT). Values would be sdf or json.
--meta: Write additional metadata and metrics (mostly relevant to Squonk's
JSON format - see below). Default is not to write.
--thin: Write output in thin format (only present where this makes sense).
Default is not to use thin format.
The JSON format for input and oputput makes heavy use of UUIDs that uniquely
identify each structure. Generally speaking, if the structure is not changed
(e.g. properties are just being added to input structures) then the existing
UUID should be retained so that UUIDs in the output match those from the input.
However if new structures are being generated (e.g. in reaction enumeration
or conformer generation) then new UUIDs MUST be generated as there is no longer
a straight relationship between the input and output structures. Instead you
probably want to store the UUID of the source structure(s) as a field(s) in
the output. To allow correlation of the outputs to the inputs (e.g. for conformer
generation output the source molecule UUID as a field so that each conformer
identifies which source molecule it was derived from.
When not using JSON format the need to handle UUIDs does not necessarily apply
(though if there is a field named 'uuid' in the input it will be respected accordingly).
To accommodate this you are recommended to ALSO specify the input molecule number
(1 based index) as an output field independent of whether UUIDs are being handled
as a "poor man's" approach to correlating the outputs to the inputs.
When a service that filters molecules special attention is needed to ensure
that the molecules are output in the same order as the input (obviously skipping
structures that are filtered out). Also the service descriptor (.dsd.json) file needs special care. For
instance take a look at the "thinDescriptors" section of src/pipelines/rdkit/screen.dsd.json
When using multi-threaded execution this is especially important as results
will usually not come back in exactly the same order as the input.
To provide information about what happened you are strongly recommended to generate
a metrics output file (e.g. output_metrics.txt). This file allows to provide
feedback about what happened. The contents of this file are fairly simple,
each line having a
syntax. Keys beginning and ending with __ (2 underscores) have magical meaning.
All other keys are treated as metrics that are recorded against that execution.
The current magical values that are recognised are:
- InputCount: The total count of records (structures) that are processed
- OutputCount: The count of output records
- ErrorCount: The number of errors encountered
Here is a typical metrics file:
__InputCount__=360 __OutputCount__=22 PLI=360
It defines the input and output counts and specifies that 360 PLI 'units'
should be recorded as being consumed during execution.
The purpose of the metrics is primarily to be able to chage for utilisation, but
even if not charging (which is often the case) then it is still good practice
to record the utilisation.
Squonk's JSON format requires additional metadata to allow proper handling
of the JSON. Writing detailed metadata is optional, but recommended. If
not present then Squonk will use a minimal representation of metadata, but
it's recommended to provide this directly so that additional information can
At the very minimum Squonk needs to know the type of dataset (e.g. MoleculeObject
or BasicObject), but this should be handled for you automatically if you use
the utils.default_open_output* methods. Better though to also specify metadata for
the field types when you do this. See e.g. conformers.py for an example of
how to do this.
Any questions contact: