Changes between Version 6 and Version 7 of CMD2RDF/sysarch


Ignore:
Timestamp:
05/22/14 06:10:39 (10 years ago)
Author:
Menzo Windhouwer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CMD2RDF/sysarch

    v6 v7  
    1616Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still it's expected that we can gain more (and will need to because the growth of the number of CMD records harvested) performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.
    1717
    18 == Caching
     18=== Caching
    1919The CMD2RDF tool should support caching. As next to conversion also enrichment of the RDF with links into the linked data cloud is envisioned this caching mechanism should be general, i.e., can cache arbitrary HTTP (GET?) requests.
    2020
     
    2727There is an [[source:MDService2/trunk/MDService2/src/eu/clarin/cmdi/mdservice/action/Cache.java|implementation of cache]] within the old prototype of [[https://trac.clarin.eu/wiki/CmdiMetadataServices|MDService2]]. It was adapted specific to the needs (especially to the REST-interface, i.e. request parameters) of the service, but maybe it can be used as base/inspiration.
    2828
    29 == Workflow
     29=== Workflow
    3030Conversion and enrichment might involve several steps of different types or purposes:
    3131
     
    6969    </action>
    7070  </profile>
    71   <component/>
     71  <component>
     72    <action name="transform">
     73      <stylesheet href="Component2RDF.xsl">
     74        <!-- pass-on the out directory defined in the config section above -->
     75        <with-param name="out" select="$out"/>
     76      </stylesheet>
     77    </action>
     78  </component>
    7279</config>
    7380}}}
     
    8895  c. the result of the last action should be RDF XML to be uploaded to the Graph Store
    8996
    90 '''Note''': for phase 3 phase 2 could output the components, which might complicated as until now they would just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler.
     97'''Note''': for phase 3 phase 2 could output the components, which might complicate the caching as until now the actions would just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler.
    9198
    92 Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API. Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class, and execute it.
     99Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API (but as a start we can hardcode the action types). Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class (that would require reflection), and execute it.
    93100
    94101For inspiration: the [[https://github.com/TheLanguageArchive/oai-harvest-manager|OAI harvester]] allows to configure actions on harvested records, e.g., to convert OLAC to CMDI.
     
    96103'''TODO:''' the harvest isn't incremental, but CMD2RDF could be. It could handle only the delta of a previous run, i.e., convert and enrich new CMD records, convert and enrich updated CMD records, remove graphs for deleted CMD records, the same for profiles and components. MD5 checksums might help to determine if CMD records have been changed ...
    97104
    98 
    99 == Enrichment - Mapping Field Values to Semantic Entities
     105=== Enrichment - Mapping Field Values to Semantic Entities
    100106The biggest challenge and at the same time the most rewarding/useful step of the process is the resolution of the literal values to resource/entity references (from other/external vocabularies). This is essentially the "linking" step in the "Linked Open Data".
    101107
     
    113119[[Image(SMC_CMD2LOD.png,80%,align=center)]]
    114120
    115 === Modelling issues
     121==== Modelling issues
    116122
    117123In the RDF modelling of the CMD data, we foresee two predicates for every CMD-element, one holding the literal value (e.g. `cmd:hasPerson.OrganisationElementValue`) and one for the corresponding resolved entity (`cmd:hasPerson.OrganisationElementEntity`). The triple with literal value is generated during the basic `CMDRecord2RDF` transformation. The entity predicate has to be generated in a separate action, essentially during the lookup step.