Changeset 3814 for CMDI-Interoperability
- Timestamp:
- 10/18/13 21:41:04 (11 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex
r3813 r3814 7 7 \usepackage{framed} 8 8 9 \usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim 9 10 10 11 %\newcommand{\comment}[1]{} 11 \newcommand{\comment }[1]{\textcolor{red}{#1}}12 \newcommand{\commentx}[1]{\textcolor{red}{#1}} 12 13 13 14 %%% PAGE DIMENSIONS … … 39 40 { \end{tabular} \end{shaded*} \end{ttfamily} } 40 41 41 \definecolor{shadecolor}{rgb}{0.9 ,0.9,1.0}42 \definecolor{shadecolor}{rgb}{0.95,0.95,1.0} 42 43 43 44 % xml syntax highlighting … … 58 59 \begin{document} 59 60 60 \title{ The CMD Cloud}61 62 \author{M enzo Windhouwer\inst{2} \and Matej Durco\inst{1}}61 \title{Component Metadata to Linked Open Data} 62 63 \author{Matej Durco\inst{1} \and Menzo Windhouwer\inst{2}} 63 64 64 65 \institute{\email{matej.durco@assoc.oeaw.ac.at}\newline … … 73 74 The hype/trend to Web of Data... 74 75 75 Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that expressing the whole of CMD data as an ontology, linking it with external semantic resources and provide as Linked Open Data, will allow to fully harness the power of semantic technologies and opens a new level of processing and exploring of CMD data. In this paper, we propose an expression of the whole of the CMD data domain (from meta model to individual metadata records) in RDF. 76 Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data linked with external semantic resources, will allow to fully exploit the power of semantic technologies and opens a new level of processing and exploring of CMD data. In this paper, we propose an expression of the whole of the CMD data domain (from meta model to individual metadata records) in RDF. 77 78 \commentx{Menzo: I don't think we can express CMD data automatically as an ontology. For that too many semantics are still hidden in CMDI. We are building blocks (e.g., RR/CLAVAS) that might enable us to do so in the future, but I think its better now to go for CMD as LOD linked into the LOD cloud ...} 76 79 77 80 \end{abstract} … … 84 87 \section{Introduction} 85 88 % 86 \comment {Not sure how much of the introduction, CMD explain + Status of the data domain we want, may and need to reuse between the two papers...}89 \commentx{Not sure how much of the introduction, CMD explain + Status of the data domain we want, may and need to reuse between the two papers...} 87 90 88 91 89 92 The hype/trend to Web of Data... 90 93 91 In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF constituting one large ontologyinterlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.92 93 % 94 \section{The Component Metadata Infrastructure} 94 In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data. 95 96 % 97 \section{The Component Metadata Infrastructure}\label{CMDI} 95 98 % 96 99 ? … … 119 122 Linked Data\cite{TimBL2006}, RDF\cite{RDF2004} 120 123 124 dbpedia, Yago - huge compiled knowledgebases to link to... 125 121 126 Ontology for Language Technology: LT-World \cite{Joerg2010} 122 127 123 128 LOD cloud Cyganiak and Jentzsch\cite{Cyganiak2010}. 124 125 129 126 130 … … 138 142 \subsection{CMD specification} 139 143 140 The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation: 144 The main entity of the meta model is the CMD component modelled as \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes, relation to the containing component) it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:ElementEntity}. The attributes are modelled analogously with \code{cmdm:Attribute, cmdm:AttributeValue, cmdm:AttributeEntity}. 145 146 The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}, again analogously for attributes of individual components and elements \code{cmdm:containsAttribute}. 141 147 142 148 \label{table:rdf-spec} 143 149 \begin{example3} 144 cmds:Component & subClassOf & owl:Class. \\ 145 cmds:Profile & subClassOf & cmds:Component. \\ 146 cmds:Element & subClassOf & rdf:Property. \\ 150 @prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . \\ 151 \\ 152 cmdm:Component & a & rdfs:Class . \\ 153 cmdm:Profile & rdfs:subClassOf & cmdm:Component . \\ 154 cmdm:Element & a & rdfs:Class . \\ 155 cmdm:Attribute & a & rdfs:Class . \\ 156 \\ 157 cmdm:contains & a & rdf:Property ; \\ 158 & rdfs:domain & cmdm:Component ; \\ 159 & rdfs:range & :Component , :Element . \\ 160 161 %cmdm:containsAttribute & a &rdf:Property; 162 % & rdfs:domain & :Component, :Element; 163 % & rdfs:range & :Attribute. 164 165 cmdm:Value & a & rdfs:Literal . \\ 166 cmdm:Entity & a & rdfs:Class . \\ 167 \\ 168 cmdm:hasElementValue & a & rdf:Property ; \\ 169 & rdfs:domain & cmdm:Element ; \\ 170 & rdfs:range & cmdm:Value . \\ 171 \\ 172 \multicolumn{3}{l}{\# add a parallel separate property for the resolved entities} \\ 173 cmdm:hasElementEntity & a & rdf:Property ; \\ 174 & rdfs:domain & :Element ; \\ 175 & rdfs:range & :Entity . \\ 176 \\ 177 cmdm:hasAttributeValue & a & rdf:Property ; \\ 178 & rdfs:domain & cmdm:Attribute ; \\ 179 & rdfs:range & rdfs:Literal . \\ 180 181 cmdm:hasAttributeEntity & a & rdf:Property ; \\ 182 & rdfs:domain & :Attribute ; \\ 183 & rdfs:range & :Entity . \\ 147 184 \end{example3} 148 185 149 186 \noindent 150 This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry): 187 This entities are used for modelling the actual profiles, components and elements as they are defined in the Component Registry. 188 For stand-alone/top components, the IDs as issued by Component Registry can be used as entity IRIs. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197\#Actor\_Languages.Actor\_Language}). 189 190 \commentx{Matej: shouldn't we add the name of the component in the IRI for human-readability? 191 similar to how it is generated in profile XSDs: \textless xs:simpleType name="simpletype-MimeType-clarin.eu.cr1.c\_1290431694511"\textgreater } 192 151 193 152 194 \label{table:rdf-cmd} … … 155 197 & rdfs:label & "collection"; \\ 156 198 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\ 157 cmd:Actor & a & cmds:Component. \\ 158 cmd:LanguageName & a & cmds:Element. \\ 159 \end{example3} 199 cr:clarin.eu:cr1:c\_1271859438197\#Actor \\ 200 & a &cmdm:Component. \\ 201 \end{example3} 202 203 \commentx{Menzo: we need more context for inner components. In the example LanguageName looks well defined, but take a Component/Element like Title. Is it the title of a book or the title of a person. Only when the semantics are clear, e.g., with a dcr:datcat, one can ignore the context and collapse all Components/Elements to a single RDF class/property.} 204 \commentx{Matej: wouldn't that be remedied by cmdm:contains? or is it too much inferencing?} 160 205 161 206 \begin{notex} 162 Should the ID assigned in the Component Registry for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness â generate the name from the cmd-path?) 207 Menzo: inner components don't have IDs so I propose a path build from the context up to a shareable component (we need some nice term for that, in the TDS I called it a top notion so maybe a top component. The cmd prefix also needs to be bound to a component specific URI. This URI contains the top component ID, e.g., \furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. 163 208 \end{notex} 164 209 165 210 \subsection{Data Categories} 166 Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties: 211 Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties 212 so as to avoid too strong semantic implications. 167 213 168 214 \begin{example3} … … 179 225 \end{example3} 180 226 227 \begin{comment} 181 228 Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms 182 229 used usually directly as data properties: … … 186 233 \end{example3} 187 234 188 \noindent 189 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL} 235 However, e argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, 190 236 In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals: 191 237 … … 194 240 \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\ 195 241 \#myNoun & owl:sameAs & isocat:DC-1333. \\ 196 \end{example3} 197 242 \end{example3} 243 244 \end{comment} 198 245 199 246 \subsection{RELcat - Ontological relations} 200 As described in \ref{ def:rr}relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:247 As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms: 201 248 202 249 \begin{example3} … … 205 252 206 253 \noindent 207 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. 208 209 \begin{ notex}210 Does this mean, that I would say: 211 \ begin{example3}212 rel:sameAs & owl:equivalentProperty & owl:sameAs 213 \ end{example3}214 215 to enable the inference of the equivalences? 216 254 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping: 255 256 \begin{example3} 257 rel:sameAs & rdfs:subPropertyOf & owl:sameAs 258 \end{example3} 259 260 \commentx{Menzo: I would use owl:sameAs rdfs:subPropertyOf rel:sameAs. I see the rel:* properties as an upper layer of a taxonony of relation types. The RELcat types are loose and the OWL ones specific, hence the subtyping. In RELcat you might also query multiple graphs with multiple vocabularies various 'same-as' properties then still need to be distinguishable but the general rel:sameAs need to be created.} 261 262 \commentx{Matej: strip this stipulations - rest of the subsection or just short referrer to SPIN rules ?} 263 \begin{comment} 217 264 Is this correct: 218 265 ?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.: … … 222 269 \end{example2} 223 270 271 \commentx{Menzo: yes. I do have some of the SPIN rules somewhere to generate those. My idea is that one takes a dcr:datcat annotated graph. This can be using OWL or SKOS or any other RDF vocabulary. This base graph should have been expanded depending on the reasoning one uses, i.e., all entailments are in place. The dcr:datcat can then be translated into rel:sameAs and all equivalences get expanded, so one can also query using ISOcat DCs.} 272 224 273 \noindent 225 274 following facts need to be present in the ontology : … … 229 278 cmd:PublicationYear & owl:equivalentProperty & isocat:DC-2538 \\ 230 279 isocat:DC-2538 & rel:sameAs & dc:created \\ 231 rel:sameAs & owl:equivalentProperty & owl:sameAs \\280 owl:sameAs & rdfs:subPropertyOf & rel:sameAs \\ 232 281 $\rightarrow$ \\ 233 282 <lr1> & dc:created & 2012\^{}\^{}xs:year \\ 234 283 \end{example3} 235 236 \end{notex} 237 238 \noindent 239 What about other relations we may want to express? (Do we need them and if yes, where to put them? â still in RR?) Examples: 240 241 \begin{example3} 242 cmd:MDCreator & owl:subClassOf & dcterms:Agent \\ 243 clavas:Organization & owl:subClassOf & dcterms:Agent \\ 244 <org1> & a & clavas:Organization \\ 245 \end{example3} 246 284 \end{comment} 285 286 287 %%%%%%%%%%%%%%%%%%%%% 247 288 \subsection{CMD instances} 248 In the next step, we want to express the individual CMD instances, the metadata records: 249 %, based on the previously defined entities on the schema level, but also entities from external ontologies. 289 In the next step, we want to express the individual CMD instances, the metadata records. 250 290 251 291 \subsubsection {Resource Identifier} 252 292 293 \commentx{Matej: I still yearn for something like cmdm:Resource and cmdm:MDRecord} 294 \begin{example3} 295 <lr1> & a & cmdm:Resource; \\ 296 <lr1.cmd> & a & cmdm:MDRecord; 297 \end{example3} 298 253 299 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>} from \code{cmd:MdSelfLink} element) could be used as the resource identifier. 254 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}: 255 256 \begin{example3} 257 \_:anno1 & a & oa:Annotation; \\ 258 & oa:hasTarget & <lr1>; \\ 259 & oa:hasBody & <lr1.cmd>; \\ 260 & oa:motivatedBy & oa:describing \\ 300 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}. 301 (Note also, that one MD record can describe multiple resources, this can be also easily accommodated in OpenAnnotation: 302 303 \commentx{Menzo: also there can be multiple resource proxies. Maybe we can use an RDF list?} 304 305 \begin{example3} 306 \_:anno1 & a & oa:Annotation ; \\ 307 & oa:hasTarget & <lr1a>, <lr1b> ; \\ 308 & oa:hasBody & <lr1.cmd> ; \\ 309 & oa:motivatedBy & oa:describing . \\ 261 310 \end{example3} 262 311 … … 266 315 267 316 \begin{example3} 268 <lr1.cmd> & dcterms:identifier & <lr1.cmd> ; \\269 & dcterms:creator ?? & "\var{\{cmd:MdCreator\}}"; \\270 & dcterms:publisher & <http://clarin.eu> , <provider-oai-accesspoint>; ??\\271 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ??\\317 <lr1.cmd> & dcterms:identifier & <lr1.cmd> ; \\ 318 & dcterms:creator & \var{\{cmd:MdCreator\}} ; \\ 319 & dcterms:publisher & <http://clarin.eu> ; \\ 320 & dcterms:created & \var{\{cmd:MdCreated\}} . \\ 272 321 \end{example3} 273 322 … … 277 326 278 327 \begin{example3} 279 <lr0.cmd> & a & ore:ResourceMap \\280 <lr0.cmd> & ore:describes & <lr0.agg> \\281 <lr0.agg> & a & ore:Aggregation \\282 & ore:aggregates & <lr1.cmd>, <lr2.cmd> ;\\283 \end{example3} 284 285 \ begin{notex}286 ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation? 287 328 <lr0.cmd> & a & ore:ResourceMap . \\ 329 <lr0.cmd> & ore:describes & <lr0.agg> . \\ 330 <lr0.agg> & a & ore:Aggregation ; \\ 331 & ore:aggregates & <lr1.cmd>, <lr2.cmd> . \\ 332 \end{example3} 333 334 \commentx{Matej: Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?} 335 336 \begin{comment} 288 337 This is rather complicated: skip this?: 289 338 Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part. … … 297 346 & ore:aggregates & <lr1.cmd>, <lr2.cmd>; \\ 298 347 \end{example3} 299 \end{ notex}348 \end{comment} 300 349 301 350 \subsubsection{Components â nested structures} 302 303 There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components: 304 305 \begin{enumerate}%[a)] 306 \item the components are encoded as object property 307 308 \begin{example3} 309 <lr1> & cmd:Actor & \_:Actor1 \\ 310 <lr1> & cmd:Actor & \_:Actor2 \\ 311 \_:Actor1 & cmd:motherTongue & iso-639:aac \\ 312 \_:Actor2 & cmd:motherTongue & iso-639:deu \\ 313 \_:Actor1 & cmd:role & "Interviewer" \\ 314 \_:Actor2 & cmd:role & "Speaker" \\ 315 \end{example3} 316 317 318 \item a dedicated object property is used 319 320 \begin{example3} 321 \_:Actor1 & a & cmd:Actor \\ 322 <lr1> & cmd:contains & \_:Actor1 \\ 323 \end{example3} 324 325 \end{enumerate} 351 For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used: 352 353 \begin{example3} 354 \_:actor1 & a & cmd:Actor . \\ 355 ?? <lr1> ? & cmd:contains & \_:actor1 . \\ 356 ?? <lr1.cmd> ? & cmd:contains & \_:actor1 . \\ 357 \end{example3} 326 358 327 359 \subsection{Elements, Fields, Values} 328 360 Finally, we want to integrate also the actual field values in the CMD records into the ontology. 329 330 \subsubsection{Predicates} 331 As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property: 332 361 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property. 362 363 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities. Following example show the whole chain of statements from metamodel to literal value. The mapping process is detailed in \ref{sec:values2entities}. 364 365 \begin{example3} 366 cmd:Person & a & cmdm:Component . \\ 367 cmd:Organisation & a & cmdm:Element . \\ 368 cmd:hasOrganisationElementValue \\ 369 & rdfs:subProperyOf & cmdm:hasElementValue ; \\ 370 & rdfs:domain & cmd:Organisation ; \\ 371 & rdfs:range & xs:string . \\ 372 cmd:hasOrganisationElementEntity \\ 373 & rdfs:subProperyOf & cmdm:hasElementEntity ; \\ 374 & rdfs:domain & cmd:Organisation ; \\ 375 & rdfs:range & cmd:OrganisationElementEnity .\\ 376 \\ 377 \multicolumn{3}{l}{\# person (mentioned in a MD record) has an affiliation (cmd:Person/cmd:Organisation) } \\ 378 \_:pers & a & cmd:Person ; \\ 379 & cmdm:contains & \_:org . \\ 380 \_:org & a & cmd:Organisation ; \\ 381 & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\ 382 & \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity \quad <http://mpi.nl> . }\\ 383 384 <http://mpi.nl> & a & cmd:OrganisationElementEnity . 385 \end{example3} 386 387 \begin{comment} 333 388 \begin{example3} 334 389 cmd:timeCoverage & a & cmds:Element \\ 390 cmd:timeCoverageValue & a & cmds:ElementValue \\ 335 391 cmd:timeCoverage & dcr:datcat & isocat:DC-2502 \\ 336 <lr1> & cmd:timeCoverage & "19th century" \\ 337 338 \end{example3} 339 340 \subsubsection{Literal values -- data properties} 341 342 To generate triples with literal values is straightforward: 343 344 \begin{example3} 345 \var{lr:Resource} & \var{cmds:Property} & \var{xsd:string }\\ 346 <lr1> & cmd:Organisation & "MPI" 347 \end{example3} 348 349 \subsubsection{Mapping to entities -- object properties} 350 351 The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities: 352 353 \begin{example3} 354 \var{lr:Resource} & \var{cmds:Property} & \var{xsd:anyURI}\\ 355 <lr1> & cmd:Organisation\_? & <org1> \\ 392 <lr1> & cmd:contains & \_:timeCoverage1 \\ 393 \_:timeCoverage1 & a & cmd:timeCoverage \\ 394 \_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\ 395 \end{example3} 396 397 \commentx{Menzo: no need to repeat dcr:datcat in the instance.} 398 399 \begin{example3} 400 \var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\ 401 \_:organisation1 & cmd:OrganisationValue\_? & <org1> \\ 356 402 \end{example3} 357 403 … … 360 406 i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation} 361 407 \end{notex} 362 363 The mapping process is detailed in \ref{sec:values2entities} 408 \end{comment} 364 409 365 410 … … 367 412 \section{Mapping field values to semantic entities} 368 413 \label{sec:values2entities} 414 415 \commentx{this is probably definitely too much for one abstract - so we could just anounce the need for this mapping process.} 369 416 370 417 This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps: … … 378 425 \end{enumerate} 379 426 380 %\begin{figure*}[!ht] 381 %\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD} 382 %\caption{Sketch of the process of transforming the CMD metadata records to a RDF representation} 383 %\label{fig:smc_cmd2lod} 384 %\end{figure*} 385 386 \subsubsection{Identify vocabularies â CLAVAS} 387 388 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly. 427 This task is basically an application of ontology mapping method, trying to find for our ``anonymous'' concepts semantically equivalent concepts from other semantic resources / vocabularies. 428 % This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''. 429 430 The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as : 431 432 \begin{example3} 433 \_:organisation1 & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\ 434 \end{example3} 435 436 However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept-value pairs: 437 438 \begin{example3} 439 \_:1 & a & cmd:OrganisationElementEnity . \\ 440 & skos:altLabel & "MPI"; 441 \end{example3} 442 443 \subsubsection{Identify vocabularies} 444 445 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}). 389 446 390 447 The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} â a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources. 391 448 392 \begin{note} Which sources? \end{note}393 394 Data in \xne{OpenSKOS} is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}. It also maintains links to other semantic resources:395 396 \begin{example3}397 <org1> & a & skos:Concept; \\398 & skos:exactMatch & <dbpedia/org1>, <lt-world/orgx>;399 \end{example3}400 401 449 \subsubsection{Lookup} 402 450 403 451 In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing. 404 452 405 \begin{definition}[{signature of the lookup function}]453 %\begin{definition}[{signature of the lookup function}] 406 454 \begin{equation} 407 455 lookup \ ( \ DataCategory \ , \ Literal \ ) \quad \mapsto \quad ( \ Concept \ | \ Entity \ )* 408 456 \end{equation} 409 \end{definition}457 %\end{definition} 410 458 411 459 In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories, 412 which will be the result of the previous step. 413 414 \begin{definition}[{list available semantic resources for data categories}] 460 which will be the result of the previous step \ 461 462 463 %\begin{definition}{Required configuration data indicating data category to available } 415 464 \begin{equation} 416 465 DataCategory \quad \mapsto \quad SemanticResource+ 417 466 \end{equation} 418 \end{definition} 467 %\end{definition} 468 419 469 420 470 As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}. 421 However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets. 422 \begin{note} 423 Figure \ref{fig:vocabulary_proxy} sketches the general setup. 424 \end{note} 425 426 %\begin{figure*}[!ht] 427 %\includegraphics[width=1\textwidth]{VocabularyProxy_clientapp} 428 %\caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service} 429 %\label{fig:vocabulary_proxy} 430 %\end{figure*} 471 However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. 431 472 432 473 \subsubsection{Candidate evaluation} 433 The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct. 434 435 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description. 436 437 In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records. 474 The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct. In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. 475 476 %One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description. 438 477 439 478 \section{Implementation} 440 479 441 \subsection{Transformation} 442 443 Set of XSL-stylesheets 444 445 \subsection{LOD Application} 446 447 Virtuoso 480 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets. 481 Once the data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana} 482 483 % Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset. 484 448 485 449 486 \section{Conclusions and Future Work} 450 In this paper, we proposed an encoding of the whole of the CMD data domain in RDF, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration. 487 In this paper, we proposed an encoding of the whole of the CMD data domain in RDF, with special focus on the core model the general component schema. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration. 488 In the near future, a test with the whole CMD dataset will be performed. 489 And work on mapping values to entities. 490 491 With this new enhanced dataset, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources. 492 The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types. 493 451 494 452 495
Note: See TracChangeset
for help on using the changeset viewer.