Changes between Version 4 and Version 5 of CMD2RDF/sysarch
- Timestamp:
- 05/21/14 19:12:07 (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CMD2RDF/sysarch
v4 v5 12 12 13 13 * the conversion of an record needs access to the profile, when this access isn't cached the Component Registry gets overloaded; 14 * using usinga shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion.14 * using a shell scripts means, when Saxon is used, JVM upstart time and XSLT compilation costs for each conversion. 15 15 16 Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still it s expected that we can gain more, and will need to because the growth of the number of CMD records harvested,performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time.16 Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still it's expected that we can gain more (and will need to because the growth of the number of CMD records harvested) performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time. 17 17 18 18 == Caching … … 24 24 25 25 The cache of CMD profiles needed by the CMD records also indicates which profiles (and the components they use) should be converted to RDFS. The cache should thus be queryable, i.e., return a list of cached URLs starting with {{{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/}}}. 26 27 There is an [[source:MDService2/trunk/MDService2/src/eu/clarin/cmdi/mdservice/action/Cache.java|implementation of cache]] within the old prototype of [[https://trac.clarin.eu/wiki/CmdiMetadataServices|MDService2]]. It was adapted specific to the needs (especially to the REST-interface, i.e. request parameters) of the service, but maybe it can be used as base/inspiration. 26 28 27 29 == Workflow … … 95 97 96 98 97 = CMD-RDF 99 == Enrichment - Mapping Field Values to Semantic Entities 100 The biggest challenge and at the same time the most rewarding/useful step of the process is the resolution of the literal values to resource/entity references (from other/external vocabularies). This is essentially the "linking" step in the "Linked Open Data". 98 101 99 ... 102 We can identify 5 steps: 103 1. identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task, though see the `@clavas:vocabulary` attribute introduced in CMD 1.2 ) 104 2. extract distinct `< data category, value >` pairs from the metadata records 105 3. '''lookup''' the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts 106 4. assess the reliability of the match 107 5. generate new RDF triples with entity identifiers as object properties 108 109 '''Note''': It is important to perform the lookup on aggregated data (`< data category, value >` pairs), otherwise introducing a huge inefficiency. 110 111 In the RDF modelling of the CMD data, we foresee two predicates for every CMD-element, one holding the literal value (e.g. `cmd:hasPerson.OrganisationElementValue`) and one for the corresponding resolved entity (`cmd:hasPerson.OrganisationElementEntity`). The triple with literal value is generated during the basic `CMDRecord2RDF` transformation. The entity predicate has to be generated in a separate action, essentially during the lookup step. 112 113 Example of encoding a person's affiliation: 114 {{{ 115 _:org a cmd:Person.Organisation ; 116 cmd:hasPerson.OrganisationElementValue ’MPI’ˆˆxs:string ; 117 cmd:hasPerson.OrganisationElementEntity <http://www.mpi.nl/>. 118 <http://www.mpi.nl/> a cmd:OrganisationElementEntity . 119 }}} 120 121 == Questions, Issues, Discussion 122 123 * Where does the '''normalization''' of values come in?[[BR]] 124 Naturally, the lookup step is a good place to get a normalized value, as the vocabularies used for lookup mostly maintain all the alternative labels for given entity. So one would get the entity reference and the normalized value in one go.