COOKIES! Om nom nom nom...
- Data retrievers can be setup to pull information into the system.
- The information is aggregated in a knowledge base, grouped by its relation to a distinct entity.
- When information becomes known about an entity, a production rule system is ran using rules that may have arbitrarily
complex preconditions that can be used to trigger arbitrarily complex productions.
- Information about data objects can be easily enriched if it is determined that not enough information is known about
the object to process it.
- DSL free.
- Python 3.5+.
- Simple to add production rules and methods of gathering more information on-the-fly.
- Available as a Docker image.
Less documentation, more example
If you do not want to read about how the Cookie Monster system works and just want to look at an example of it in
action, please see the HGI Cookie Monster setup.
For better or for worse, naming within some parts of the system is Sesame Street themed...
- The collection of all information known about a particular data object is referred to as a "Cookie".
- The subsystem that stores a collection of Cookies is referred to as a "CookieJar".
- The HTTP API is referred to as "Elmo".
The system is called "Cookie Monster" as its behaviour is similar to that of the
Cookie Monster character in Sesame Street: it
shovels in all of the cookies but only a few get digested/mashed into the hand puppet, with the rest falling back out.
At a minimum, a Cookie Monster installation comprises of a CookieJar that can store Cookies. It is essentially a
knowledge base that stores unstructured JSON data and a limited amount of associated metadata. Each Cookie in the jar
holds an the identifier of the data object to which it relates. A Cookie may also contain a number of "enrichments",
each of which holds information about the data object, along with details about where and when this information was
A CookieJar implementation (named
BiscuitTin), which uses a CouchDB database, is supplied. It can be setup with:
cookie_jar = BiscuitTin(couchdb_host, couchdb_database_name)
A Cookie Monster installation can be setup with a Processor Manager, which uses Processors to examine Cookies after they
have been enriched. Processors essentially implement a production rule system, where predefined rules are evaluated in
order of priority. If a rule's precondition is matched, its action is triggered, which may be an arbitrary set of
instructions. The action method's return value can be used to indicate whether any further rules should be processed
with the cookie. In the case where no rules are matched/no rules indicate no further processing is required, the
Processor will check if the Cookie can be enriched further using an Enrichment Loader and put any extra information into
the knowledge base.
A simple implementation of a Processor Manager (named
BasicProcessorManager) is supplied. This can be constructed as
processor_manager = BasicProcessorManager(number_of_processors, cookie_jar, rules_source, enrichment_loader_source)
It can then be setup to process Cookies as they are enriched in the CookieJar:
Rules have a matching criteria (a precondition) to which Cookies are compared to determine if any action should be
taken. If matched, the rule's action is executed, which can be an arbitrary set of commands. The action method then
returns whether further processing of the Cookie is required. The order in which rules are evaluated is determined by
Changing rules on-the-fly
RuleSource is being used by your
ProcessorManager to attain the rules that are evaluated by
instances, it is possible to dynamically changes the rules used by the Cookie Monster for future jobs (jobs already
running will continue to use the set of rules that they had when they were started).
The following example illustrates how a rule is defined and registered. If appropriate, the code can be inserted into an
existing rule file. Alternatively, it can be added to a new file in the rules directory, with a name matching the
*rule.py. Rule files can be put into subdirectories. If the Python module does not compile (e.g. it
contains invalid syntax or uses a Python library that has not been installed), the module will be ignored.
from cookiemonster.models import Cookie, Rule from hgicommon.mixable import Priority from hgicommon.data_source import register MY_RULE_IDENTIFIER = "my_rule" def _matches(cookie: Cookie, context: Context) -> bool: return "my_study" in cookie.path def _action(cookie: Cookie, context: Context) -> bool: # <Interesting actions> return whether_any_more_rules_should_be_processed _priority = Priority.MAX_PRIORITY _rule = Rule(_matches, _generate_action, MY_RULE_IDENTIFIER, _priority) register(_rule)
To delete a pre-existing rule, delete the file containing it or remove the relevant call to
register. To modify a
rule, simply change its code and it will be updated in Cookie Monster when it is saved.
Please see the [rules used in the HGI Cookie Monster setup]
If all the rules have been evaluated and none of them defined in their action that no further processing of the Cookie
is required, cookie "enrichment loaders" can be used to load more information about a cookie.
Changing enrichment loaders on-the-fly
Similarly to rules, the enrichment loaders can be changed during execution. Files containing enrichment
loaders must have a name matching the format:
from cookiemonster import EnrichmentLoader, Cookie, Enrichment from hgicommon.mixable import Priority from hgicommon.data_source import register MY_ENRICHMENT_IDENTIFIER = "my_enrichment" def _can_enrich(cookie: Cookie, context: Context) -> bool: return "my_data_source" in [enrichment.source for enrichment in cookie.enrichments] def _load_enrichment(cookie: Cookie, context: Context) -> Enrichment: return my_data_source.load_more_information_about(cookie.path) _priority = Priority.MAX_PRIORITY _enrichment_loader = EnrichmentLoader(_can_enrich, _load_enrichment, MY_ENRICHMENT_IDENTIFIER, _priority) register(_enrichment_loader)
Please see the [enrichment loaders used in the HGI Cookie Monster setup]
A Cookie Monster installation may use data retrievers, which get updates about data objects that can be used to enrich
(which will create if no previous information is known) related Cookies in the CookieJar.
A retriever that periodically gets information about updates made to entities in an iRODS database
is shipped with the system. In order to use it, the specific queries defined in
resources/specific-queries must be installed on your iRODS server and a version of
baton above 0.16.3 must be installed. It can be
setup as such:
update_mapper = BatonUpdateMapper(baton_binaries_location) database_connector = SQLAlchemyDatabaseConnector(retrieval_log_database) retrieval_log_mapper = SQLAlchemyRetrievalLogMapper(database_connector) retrieval_manager = PeriodicRetrievalManager(retrieval_period, update_mapper, retrieval_log_mapper)
Then linked to a CookieJar by:
executor = ThreadPoolExecutor(max_workers=NUMBER_OF_THREADS) def put_updates_in_cookie_jar(update_collection: UpdateCollection): for update in update_collection: enrichment = Enrichment("irods_update", datetime.now(), update.metadata) executor.submit(timed_enrichment, update.target, enrichment) retrieval_manager.add_listener(put_updates_in_cookie_jar)
A JSON-based HTTP API is provided to expose certain functionality as an
outwardly facing interface, on a configurable port. Currently, the
following endpoints are defined:
GETGet the current status details of the "to process" queue,
returning a JSON object with the following members:
POSTMark a file as requiring reprocessing, which will immediately
return it (if necessary) to the "to process" queue. This method
expects a JSON request body consisting of an object with a
member; returning the same.
GETGet a file and its enrichments from the metadata repository, by
its identifier. (Note that the identifier must be percent encoded. If
it begins with a slash, then the query string form of this endpoint
must be used.)
DELETEDelete a file and its enrichments from the metadata
repository, by its identifier. (Note that the identifier must be
percent encoded. If it begins with a slash, then the query string form
of this endpoint must be used.)
GETRetrieve runtime state of all the current threads, for
Note that all requests must include
application/json in their
How to develop
To run the tests, use
./scripts/run-tests.sh from the project's root directory. This script will use
install all requirements for running the tests. Some tests use Docker therefore a Docker
daemon must be running on the test machine, with the environment variables