Changes between Version 2 and Version 3 of CMD2RDF/sysarch


Ignore:
Timestamp:
05/21/14 14:26:02 (10 years ago)
Author:
Menzo Windhouwer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CMD2RDF/sysarch

    v2 v3  
    44
    55[[Image(CMD2RDF-sysarch.png,80%,align=center)]]
     6
     7The OAI harvester regularly collects all CMDI records from the CLARIN joint metadata domain and stores it on the catalog.clarin.eu servers filesystem for other tools, e.g., the VLO importer or CMD2RDF to pick up.
     8
     9== CMD2RDF
     10
     11In initial stages sample sets of CMD record have been converted to RDF and this showed several bottlenecks/problems:
     12
     13* the conversion of an record needs access to the profile, when this access isn't cached the Component Registry gets overloaded;
     14* using using a shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion.
     15
     16Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still its expected that we can gain more, and will need to because the growth of the number of CMD records harvested, performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.
     17
     18== Caching
     19The CMD2RDF tool should support caching. As next to conversion also enrichment of the RDF with links into the linked data cloud is envisioned this caching mechanism should be general, i.e., can cache arbitrary HTTP (GET?) requests.
     20
     21The cache could have a simple API and should be persistent, i.e., on a subsequent run of CMD2RDF the same CMD profiles, components and possibly other web resources should be available. For web resources HTTP caching directives might be obliged, and might indicate a request to refresh the cached resource for this run. Public CMD profiles and component are in principal static. However, from time to time Component Registry administrators do get requests to update, for example, a Concept Link. So the cache might decide to once in a while refresh the cached public profile or component.
     22
     23Some, if not all, of the conversion and enrichment takes place with the XSLT 2.0 Saxon processor. To keep the XSLT 2.0 stylesheets generic they fetch the CMD profile, component or other web resource using the document() or doc() function. Saxon allows one [[http://www.saxonica.com/documentation/index.html#!javadoc/net.sf.saxon/Configuration@setURIResolver|to register an URLResolver]], which can interact with the generic CMD2RDF cache.     
     24
     25The cache of CMD profiles needed by the CMD records also indicates which profiles (and the components they use) should be converted to RDFS. The cache should thus be queryable, i.e., return a list of cached URLs starting with {{{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/}}}.
     26
     27== Workflow
     28Conversion and enrichment might involve several steps of different types or purposes:
     29
     301. XSLT 2.0 transformation
     312. enrichment by some Java code
     323. ...
     33
     34The actions CMD2RDF takes for a record or a profile/component could be configurable. A first sketch:
     35
     36{{{
     37#!xml
     38<CMD2RDF>
     39  <config>
     40    <!-- some general config:
     41      - path to the harvested records
     42      - base-uri of the component registry
     43      - nr of threads
     44      - output directory
     45      - Virtuoso or even a general endpoint that supports the Graph Store HTTP protocol (http://www.w3.org/TR/sparql11-http-rdf-update/)
     46      - ...
     47    -->
     48  </config>
     49  <record>
     50    <action name="transform">
     51      <stylesheet href="CMDRecord2RDF.xsl">
     52        <!-- pass-on the registry defined in the config section above -->
     53        <with-param name="registry" select="$registry"/>
     54      </stylesheet>
     55    </action>
     56    <action name="java">
     57      <class name="eu.clarin.cmd2rdf.findOrganisationNames"/>
     58    </action>
     59    <!-- ... other actions ... -->
     60  </record>
     61  <profile>
     62    <action name="transform">
     63      <stylesheet href="Component2RDF.xsl">
     64        <!-- pass-on the out directory defined in the config section above -->
     65        <with-param name="out" select="$out"/>
     66      </stylesheet>
     67    </action>
     68  </profile>
     69  <component/>
     70</config>
     71}}}
     72
     73So CMD2RDF tool would have 3 main phases:
     74
     751. Process the CMD records (in multiple threads)
     76  a. for each record process the actions sequentially, where the output of each action is the input of the next
     77  b. the result of the last action should be RDF XML to be uploaded to the Graph Store
     782. Process the profiles (as used by the records) (in multiple threads)
     79  a. for each profile process the actions sequentially
     80  b. the result of the last action should be RDF XML to be uploaded to the Graph Store
     813. Process the components (as used by the profiles) (in multiple threads)
     82  a. for each component process the actions sequentially
     83  b. the result of the last action should be RDF XML to be uploaded to the Graph Store
     84
     85'''Note''': for phase 3 phase 2 could output the components, which might complicated as until now they would just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler.
     86
     87Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API. Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class, and execute it.
     88
     89For inspiration: the [[https://github.com/TheLanguageArchive/oai-harvest-manager|OAI harvester]] allows to configure actions on harvested records, e.g., to convert OLAC to CMDI.
     90
     91= CMD-RDF
     92
     93...