wiki:CMD2RDF/sysarch

Version 4 (modified by Menzo Windhouwer, 10 years ago) (diff)

--

CMD2RDF system architecture

This page describes (a first sketch of) a system architecture for CMD2RDF. The aim of this architecture is to be able to keep the RDF(S) representation of CMDI up-to-date with the growing CLARIN joint metadata domain and offer stable services on top of that. The following figure provides a high level overview:

The OAI harvester regularly collects all CMDI records from the CLARIN joint metadata domain and stores it on the catalog.clarin.eu servers filesystem for other tools, e.g., the VLO importer or CMD2RDF to pick up.

CMD2RDF

In initial stages sample sets of CMD record have been converted to RDF and this showed several bottlenecks/problems:

  • the conversion of an record needs access to the profile, when this access isn't cached the Component Registry gets overloaded;
  • using using a shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion.

Costs of this can be lessened by using SMC scripts for caching profiles and running the XSLT. Still its expected that we can gain more, and will need to because the growth of the number of CMD records harvested, performance by developing a multi-threaded special purpose Java tool. The CMDIValidator tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.

Caching

The CMD2RDF tool should support caching. As next to conversion also enrichment of the RDF with links into the linked data cloud is envisioned this caching mechanism should be general, i.e., can cache arbitrary HTTP (GET?) requests.

The cache could have a simple API and should be persistent, i.e., on a subsequent run of CMD2RDF the same CMD profiles, components and possibly other web resources should be available. For web resources HTTP caching directives might be obliged, and might indicate a request to refresh the cached resource for this run. Public CMD profiles and component are in principal static. However, from time to time Component Registry administrators do get requests to update, for example, a Concept Link. So the cache might decide to once in a while refresh the cached public profile or component.

Some, if not all, of the conversion and enrichment takes place with the XSLT 2.0 Saxon processor. To keep the XSLT 2.0 stylesheets generic they fetch the CMD profile, component or other web resource using the document() or doc() function. Saxon allows one to register an URLResolver, which can interact with the generic CMD2RDF cache.

The cache of CMD profiles needed by the CMD records also indicates which profiles (and the components they use) should be converted to RDFS. The cache should thus be queryable, i.e., return a list of cached URLs starting with http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/.

Workflow

Conversion and enrichment might involve several steps of different types or purposes:

  1. XSLT 2.0 transformation
  2. enrichment by some Java code
  3. ...

The actions CMD2RDF takes for a record or a profile/component could be configurable. A first sketch:

<CMD2RDF>
  <config>
    <!-- some general config:
      - path to the harvested records
      - base-uri of the component registry
      - nr of threads
      - output directory
      - Virtuoso or even a general endpoint that supports the Graph Store HTTP protocol (http://www.w3.org/TR/sparql11-http-rdf-update/)
      - ...
    -->
  </config>
  <record>
    <action name="transform">
      <stylesheet href="CMDRecord2RDF.xsl">
        <!-- pass-on the registry defined in the config section above --> 
        <with-param name="registry" select="$registry"/>
      </stylesheet>
    </action>
    <action name="java">
      <class name="eu.clarin.cmd2rdf.findOrganisationNames"/>
    </action>
    <!-- ... other actions ... -->
  </record>
  <profile>
    <action name="transform">
      <stylesheet href="Component2RDF.xsl">
        <!-- pass-on the out directory defined in the config section above --> 
        <with-param name="out" select="$out"/>
      </stylesheet>
    </action>
  </profile>
  <component/>
</config>

So CMD2RDF tool would have 3 main phases:

  1. Process the CMD records (in multiple threads)
    1. for each record process the actions sequentially, where the output of each action is the input of the next
    2. the input of the first action should be a CMD record from the OAI harvester result set
    3. the result of the last action should be RDF XML to be uploaded to the Graph Store
  2. Process the profiles (as used by the records) (in multiple threads)
    1. for each profile process the actions sequentially
    2. the input of the first action is a CMD profile from the CMD2RDF cache
    3. the result of the last action should be RDF XML to be uploaded to the Graph Store
  3. Process the components (as used by the profiles) (in multiple threads)
    1. for each component process the actions sequentially
    2. the input of the first action is a CMD component from the CMD2RDF cache
    3. the result of the last action should be RDF XML to be uploaded to the Graph Store

Note: for phase 3 phase 2 could output the components, which might complicated as until now they would just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler.

Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API. Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class, and execute it.

For inspiration: the OAI harvester allows to configure actions on harvested records, e.g., to convert OLAC to CMDI.

TODO: the harvest isn't incremental, but CMD2RDF could be. It could handle only the delta of a previous run, i.e., convert and enrich new CMD records, convert and enrich updated CMD records, remove graphs for deleted CMD records, the same for profiles and components. MD5 checksums might help to determine if CMD records have been changed ...

CMD-RDF

...

Attachments (2)

Download all attachments as: .zip