wiki:CMD2RDF/sysarch

CMD2RDF system architecture

This page describes (a first sketch of) a system architecture for CMD2RDF. The aim of this architecture is to be able to keep the RDF(S) representation of CMDI up-to-date with the growing CLARIN joint metadata domain and offer stable services on top of that. The following figure provides a high level overview:

The OAI harvester regularly collects all CMDI records from the CLARIN joint metadata domain and stores it on the catalog.clarin.eu servers filesystem for other tools, e.g., the VLO importer or CMD2RDF to pick up.

CMD2RDF

In initial stages sample sets of CMD record have been converted to RDF and this showed several bottlenecks/problems:

  • the conversion of an record needs access to the profile, when this access isn't cached the Component Registry gets overloaded;
  • using a shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion.

Costs of this can be lessened by using SMC scripts for caching profiles and running the XSLT. Still it's expected that we can gain more (and will need to because the growth of the number of CMD records harvested) performance by developing a multi-threaded special purpose Java tool. The CMDIValidator tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.

Caching

The CMD2RDF tool should support caching. As next to conversion also enrichment of the RDF with links into the linked data cloud is envisioned this caching mechanism should be general, i.e., can cache arbitrary HTTP (GET?) requests.

The cache could have a simple API and should be persistent, i.e., on a subsequent run of CMD2RDF the same CMD profiles, components and possibly other web resources should be available. For web resources HTTP caching directives might be obliged, and might indicate a request to refresh the cached resource for this run. Public CMD profiles and component are in principal static. However, from time to time Component Registry administrators do get requests to update, for example, a Concept Link. So the cache might decide to once in a while refresh the cached public profile or component.

Some, if not all, of the conversion and enrichment takes place with the XSLT 2.0 Saxon processor. To keep the XSLT 2.0 stylesheets generic they fetch the CMD profile, component or other web resource using the document() or doc() function. Saxon allows one to register an URLResolver, which can interact with the generic CMD2RDF cache.

The cache of CMD profiles needed by the CMD records also indicates which profiles (and the components they use) should be converted to RDFS. The cache should thus be queryable, i.e., return a list of cached URLs starting with http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/.

There is an implementation of cache within the old prototype of MDService2. It was adapted specific to the needs (especially to the REST-interface, i.e. request parameters) of the service, but maybe it can be used as base/inspiration.

Workflow

Conversion and enrichment might involve several steps of different types or purposes:

  1. XSLT 2.0 transformation
  2. enrichment by some Java code
  3. ...

The actions CMD2RDF takes for a record or a profile/component could be configurable. A first sketch:

<CMD2RDF>
  <config>
    <!-- some general config:
      - path to the harvested records
      - base-uri of the component registry
      - nr of threads
      - output directory
      - Virtuoso or even a general endpoint that supports the Graph Store HTTP protocol (http://www.w3.org/TR/sparql11-http-rdf-update/)
      - ...
    -->
  </config>
  <record>
    <action name="transform">
      <stylesheet href="CMDRecord2RDF.xsl">
        <!-- pass-on the registry defined in the config section above --> 
        <with-param name="registry" select="$registry"/>
      </stylesheet>
    </action>
    <action name="java">
      <class name="eu.clarin.cmd2rdf.findOrganisationNames"/>
    </action>
    <!-- ... other actions ... -->
  </record>
  <profile>
    <action name="transform">
      <stylesheet href="Component2RDF.xsl">
        <!-- pass-on the out directory defined in the config section above --> 
        <with-param name="out" select="$out"/>
      </stylesheet>
    </action>
  </profile>
  <component>
    <action name="transform">
      <stylesheet href="Component2RDF.xsl">
        <!-- pass-on the out directory defined in the config section above --> 
        <with-param name="out" select="$out"/>
      </stylesheet>
    </action>
  </component>
</config>

So CMD2RDF tool would have 3 main phases:

  1. Process the CMD records (in multiple threads)
    1. for each record process the actions sequentially, where the output of each action is the input of the next
    2. the input of the first action should be a CMD record from the OAI harvester result set
    3. the result of the last action should be RDF XML to be uploaded to the Graph Store
  2. Process the profiles (as used by the records) (in multiple threads)
    1. for each profile process the actions sequentially
    2. the input of the first action is a CMD profile from the CMD2RDF cache
    3. the result of the last action should be RDF XML to be uploaded to the Graph Store
  3. Process the components (as used by the profiles) (in multiple threads)
    1. for each component process the actions sequentially
    2. the input of the first action is a CMD component from the CMD2RDF cache
    3. the result of the last action should be RDF XML to be uploaded to the Graph Store

Note: for phase 3 phase 2 could output the components, which might complicate the caching as until now the actions would just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler.

Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API (but as a start we can hardcode the action types). Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class (that would require reflection), and execute it.

For inspiration: the OAI harvester allows to configure actions on harvested records, e.g., to convert OLAC to CMDI.

TODO: the harvest isn't incremental, but CMD2RDF could be. It could handle only the delta of a previous run, i.e., convert and enrich new CMD records, convert and enrich updated CMD records, remove graphs for deleted CMD records, the same for profiles and components. MD5 checksums might help to determine if CMD records have been changed ...

Enrichment - Mapping Field Values to Semantic Entities

The biggest challenge and at the same time the most rewarding/useful step of the process is the resolution of the literal values to resource/entity references (from other/external vocabularies). This is essentially the "linking" step in the "Linked Open Data".

We can identify 5 steps:

  1. identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task, though see the @clavas:vocabulary attribute introduced in CMD 1.2 )
  2. extract distinct < data category, value > pairs from the metadata records
  3. lookup the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
  4. assess the reliability of the match
  5. generate new RDF triples with entity identifiers as object properties

Note: It is important to perform the lookup on aggregated data (< data category, value > pairs), otherwise introducing a huge inefficiency.

Following figure sketches the processing workflow for resolving literal values to entities using a external Vocabulary Service (CLAVAS?) (it originates from the thesis by Durco: SMC4LRT):

Sketch of the processing workflow for resolving literal values to entities using a external Vocabulary Service (CLAVAS?)

Modelling issues

In the RDF modelling of the CMD data, we foresee two predicates for every CMD-element, one holding the literal value (e.g. cmd:hasPerson.OrganisationElementValue) and one for the corresponding resolved entity (cmd:hasPerson.OrganisationElementEntity). The triple with literal value is generated during the basic CMDRecord2RDF transformation. The entity predicate has to be generated in a separate action, essentially during the lookup step.

Example of encoding a person's affiliation:

_:org                  a   cmd:Person.Organisation ;
   cmd:hasPerson.OrganisationElementValue     ’MPI’ˆˆxs:string ;
   cmd:hasPerson.OrganisationElementEntity    <http://www.mpi.nl/>.
<http://www.mpi.nl/>   a   cmd:OrganisationElementEntity .

CMD-RDF

Instead of exposing Virtuoso directly the CMD-RDF is the main access point to the Virtuoso triple store. This servlet should provide or support the following interfaces:

  • a SPARQL endpoint
  • a REST API
  • if needed a browser UI, but a generic UI might be based on the SPARQL endpoint
  • an admin interface to see how CMD2RDF is behaving

SPARQL endpoint

Should in principle follow the query operation part of the SPARQL 1.1 protocol. This might be limited by what Virtuoso actually supports, but the servlet layer might be able to amend some shortcomings. To prevent burdening the server to heavily the SPARQL endpoint should be limited in the following ways:

  • only queries should be allowed, i.e., no updates
  • queries should be limited in their running time, e.g., a minute (?)
  • only a limited number of concurrent queries are allowed

All or some of these limits might already be supported by Virtuoso.

REST API

Each CMD record corresponds to a single graph which has an unique URI, e.g., https://catalog.clarin.eu/cmd/rdf/TLA/MPI123. It should become possible to access these graphs by just resolving these URIs. This can be a simple wrapper around Virtuosos support for SPARQL 1.1 Graph Store HTTP Protocol, where again only GETs should be supported.

Additionally it should become possible to retrieve sets of graphs, e.g., https://catalog.clarin.eu/cmd/rdf/TLA to get all the graphs of TLA CMD records. The return could be one big TriG document or, probably more convenient, a ZIP file containing the RDF representations of the records.

Browser

Preferably expose a generic browser, e.g., Elda.

Admin

Administrators should be able to see the status of CMD2RDF, e.g., how many graphs are stored in Virtuoso, access statistics, conversion statistics, process info.

Access to the admin console can be restricted to a configured set of Shibboleth principals.

Questions, Issues, Discussion

  • Where does the normalization of values come in?
    Naturally, the lookup step is a good place to get a normalized value, as the vocabularies used for lookup mostly maintain all the alternative labels for given entity. So one would get the entity reference and the normalized value in one go.
Last modified 10 years ago Last modified on 09/08/14 10:12:12

Attachments (2)

Download all attachments as: .zip