Changes between Version 6 and Version 7 of CMD2RDF/sysarch
- Timestamp:
- 05/22/14 06:10:39 (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
CMD2RDF/sysarch
v6 v7 16 16 Costs of this can be lessened by using [source:SMC SMC] scripts for caching profiles and running the XSLT. Still it's expected that we can gain more (and will need to because the growth of the number of CMD records harvested) performance by developing a multi-threaded special purpose Java tool. The [source:CMDIValidator CMDIValidator] tool can be used as basis/inspiration for efficient processing multiple of CMD records at the same time. 17 17 18 == Caching18 === Caching 19 19 The CMD2RDF tool should support caching. As next to conversion also enrichment of the RDF with links into the linked data cloud is envisioned this caching mechanism should be general, i.e., can cache arbitrary HTTP (GET?) requests. 20 20 … … 27 27 There is an [[source:MDService2/trunk/MDService2/src/eu/clarin/cmdi/mdservice/action/Cache.java|implementation of cache]] within the old prototype of [[https://trac.clarin.eu/wiki/CmdiMetadataServices|MDService2]]. It was adapted specific to the needs (especially to the REST-interface, i.e. request parameters) of the service, but maybe it can be used as base/inspiration. 28 28 29 == Workflow29 === Workflow 30 30 Conversion and enrichment might involve several steps of different types or purposes: 31 31 … … 69 69 </action> 70 70 </profile> 71 <component/> 71 <component> 72 <action name="transform"> 73 <stylesheet href="Component2RDF.xsl"> 74 <!-- pass-on the out directory defined in the config section above --> 75 <with-param name="out" select="$out"/> 76 </stylesheet> 77 </action> 78 </component> 72 79 </config> 73 80 }}} … … 88 95 c. the result of the last action should be RDF XML to be uploaded to the Graph Store 89 96 90 '''Note''': for phase 3 phase 2 could output the components, which might complicate d as until now theywould just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler.97 '''Note''': for phase 3 phase 2 could output the components, which might complicate the caching as until now the actions would just request resources from the cache not store them. Phase 2 could also just request the used components from the Component Registry and thus trigger caching from them. This is some extra load for the Component Registry, but might keep the CMD2RDF caching API simpler. 91 98 92 Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API . Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class, and execute it.99 Types of actions would all implement a common interface so they can be loaded by using the Java Reflection API (but as a start we can hardcode the action types). Next to the output of the previous action (or the initial CMD resource) they get the XML snippet for the action. Using this info they can configure the action, e.g., load the XSLT or Java class (that would require reflection), and execute it. 93 100 94 101 For inspiration: the [[https://github.com/TheLanguageArchive/oai-harvest-manager|OAI harvester]] allows to configure actions on harvested records, e.g., to convert OLAC to CMDI. … … 96 103 '''TODO:''' the harvest isn't incremental, but CMD2RDF could be. It could handle only the delta of a previous run, i.e., convert and enrich new CMD records, convert and enrich updated CMD records, remove graphs for deleted CMD records, the same for profiles and components. MD5 checksums might help to determine if CMD records have been changed ... 97 104 98 99 == Enrichment - Mapping Field Values to Semantic Entities 105 === Enrichment - Mapping Field Values to Semantic Entities 100 106 The biggest challenge and at the same time the most rewarding/useful step of the process is the resolution of the literal values to resource/entity references (from other/external vocabularies). This is essentially the "linking" step in the "Linked Open Data". 101 107 … … 113 119 [[Image(SMC_CMD2LOD.png,80%,align=center)]] 114 120 115 === Modelling issues121 ==== Modelling issues 116 122 117 123 In the RDF modelling of the CMD data, we foresee two predicates for every CMD-element, one holding the literal value (e.g. `cmd:hasPerson.OrganisationElementValue`) and one for the corresponding resolved entity (`cmd:hasPerson.OrganisationElementEntity`). The triple with literal value is generated during the basic `CMDRecord2RDF` transformation. The entity predicate has to be generated in a separate action, essentially during the lookup step.