Changes between Version 4 and Version 5 of CMD2RDF/sysarch


Ignore:
Timestamp:
05/21/14 19:12:07 (10 years ago)
Author:
xnrn@gmx.net
Comment:

added section on Enrichment, Questions, reference to old cache implementation, typos

Legend:

Unmodified
Added
Removed
Modified
  • CMD2RDF/sysarch

    v4 v5  
    1212
    1313* the conversion of an record needs access to the profile, when this access isn't cached the Component Registry gets overloaded;
    14 * using using a shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion.
     14* using a shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion.
    1515
    16 Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still its expected that we can gain more, and will need to because the growth of the number of CMD records harvested, performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.
     16Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still it's expected that we can gain more (and will need to because the growth of the number of CMD records harvested) performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.
    1717
    1818== Caching
     
    2424
    2525The cache of CMD profiles needed by the CMD records also indicates which profiles (and the components they use) should be converted to RDFS. The cache should thus be queryable, i.e., return a list of cached URLs starting with {{{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/}}}.
     26
     27There is an [[source:MDService2/trunk/MDService2/src/eu/clarin/cmdi/mdservice/action/Cache.java|implementation of cache]] within the old prototype of [[https://trac.clarin.eu/wiki/CmdiMetadataServices|MDService2]]. It was adapted specific to the needs (especially to the REST-interface, i.e. request parameters) of the service, but maybe it can be used as base/inspiration.
    2628
    2729== Workflow
     
    9597
    9698
    97 = CMD-RDF
     99== Enrichment - Mapping Field Values to Semantic Entities
     100The biggest challenge and at the same time the most rewarding/useful step of the process is the resolution of the literal values to resource/entity references (from other/external vocabularies). This is essentially the "linking" step in the "Linked Open Data".
    98101
    99 ...
     102We can identify 5 steps:
     1031. identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task, though see the `@clavas:vocabulary`  attribute introduced in CMD 1.2 )
     1042. extract distinct `< data category, value >` pairs from the metadata records
     1053. '''lookup''' the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
     1064. assess the reliability of the match
     1075. generate new RDF triples with entity identifiers as object properties
     108
     109'''Note''': It is important to perform the lookup on aggregated data (`< data category, value >` pairs), otherwise introducing a huge inefficiency.
     110
     111In the RDF modelling of the CMD data, we foresee two predicates for every CMD-element, one holding the literal value (e.g. `cmd:hasPerson.OrganisationElementValue`) and one for the corresponding resolved entity (`cmd:hasPerson.OrganisationElementEntity`). The triple with literal value is generated during the basic `CMDRecord2RDF` transformation. The entity predicate has to be generated in a separate action, essentially during the lookup step.
     112
     113Example of encoding a person's affiliation:
     114{{{
     115_:org                  a   cmd:Person.Organisation ;
     116   cmd:hasPerson.OrganisationElementValue     ’MPI’ˆˆxs:string ;
     117   cmd:hasPerson.OrganisationElementEntity    <http://www.mpi.nl/>.
     118<http://www.mpi.nl/>   a   cmd:OrganisationElementEntity .
     119}}}
     120
     121== Questions, Issues, Discussion
     122
     123* Where does the '''normalization''' of values come in?[[BR]]
     124  Naturally, the lookup step is a good place to get a normalized value, as the vocabularies used for lookup mostly maintain all the alternative labels for given entity. So one would get the entity reference and the normalized value in one go.