Changeset 3814 for CMDI-Interoperability


Ignore:
Timestamp:
10/18/13 21:41:04 (11 years ago)
Author:
vronk
Message:

bigger rewrite: updated according to latest rdf-code; lot of stuff commented out

File:
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3813 r3814  
    77\usepackage{framed}
    88
     9\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
    910
    1011%\newcommand{\comment}[1]{}
    11 \newcommand{\comment}[1]{\textcolor{red}{#1}}
     12\newcommand{\commentx}[1]{\textcolor{red}{#1}}
    1213
    1314%%% PAGE DIMENSIONS
     
    3940{  \end{tabular} \end{shaded*} \end{ttfamily} }
    4041
    41 \definecolor{shadecolor}{rgb}{0.9,0.9,1.0}
     42\definecolor{shadecolor}{rgb}{0.95,0.95,1.0}
    4243
    4344% xml syntax highlighting
     
    5859\begin{document}
    5960
    60 \title{The CMD Cloud}
    61 
    62 \author{Menzo Windhouwer\inst{2} \and Matej Durco\inst{1}}
     61\title{Component Metadata to Linked Open Data}
     62
     63\author{Matej Durco\inst{1} \and Menzo Windhouwer\inst{2}}
    6364
    6465\institute{\email{matej.durco@assoc.oeaw.ac.at}\newline
     
    7374The hype/trend to Web of Data...
    7475
    75 Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that expressing the whole of CMD data as an ontology, linking it with external semantic resources and provide as Linked Open Data, will allow to fully harness the power of semantic technologies and opens a new level of processing and exploring of CMD data. In this paper, we propose an expression of the whole of the CMD data domain (from meta model to individual metadata records) in RDF.
     76Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data  as Linked Open Data linked with external semantic resources, will allow to fully exploit the power of semantic technologies and opens a new level of processing and exploring of CMD data. In this paper, we propose an expression of the whole of the CMD data domain (from meta model to individual metadata records) in RDF.
     77
     78\commentx{Menzo: I don't think we can express CMD data automatically as an ontology. For that too many semantics are still hidden in CMDI. We are building blocks (e.g., RR/CLAVAS) that might enable us to do so in the future, but I think its better now to go for CMD as LOD linked into the LOD cloud ...}
    7679
    7780\end{abstract}
     
    8487\section{Introduction}
    8588%
    86 \comment{Not sure how much of the introduction, CMD explain + Status of the data domain we want, may and need to reuse between the two papers...}
     89\commentx{Not sure how much of the introduction, CMD explain + Status of the data domain we want, may and need to reuse between the two papers...}
    8790
    8891
    8992The hype/trend to Web of Data...
    9093
    91 In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    92 
    93 %
    94 \section{The Component Metadata Infrastructure}
     94In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
     95
     96%
     97\section{The Component Metadata Infrastructure}\label{CMDI}
    9598%
    9699?
     
    119122Linked Data\cite{TimBL2006}, RDF\cite{RDF2004}
    120123
     124dbpedia, Yago - huge compiled knowledgebases to link to...
     125
    121126Ontology for Language Technology: LT-World \cite{Joerg2010}
    122127
    123128LOD cloud Cyganiak and Jentzsch\cite{Cyganiak2010}.
    124 
    125129
    126130
     
    138142\subsection{CMD specification}
    139143
    140 The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:
     144The main entity of the meta model is the CMD component modelled as \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:ElementEntity}. The attributes are modelled analogously with \code{cmdm:Attribute, cmdm:AttributeValue, cmdm:AttributeEntity}.
     145
     146The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}, again analogously for attributes of individual components and elements \code{cmdm:containsAttribute}.
    141147
    142148\label{table:rdf-spec}
    143149\begin{example3}
    144 cmds:Component & subClassOf  & owl:Class. \\
    145 cmds:Profile & subClassOf  & cmds:Component. \\
    146 cmds:Element & subClassOf  & rdf:Property. \\
     150@prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . \\
     151\\
     152cmdm:Component & a & rdfs:Class . \\
     153cmdm:Profile & rdfs:subClassOf & cmdm:Component . \\
     154cmdm:Element & a & rdfs:Class . \\
     155cmdm:Attribute & a & rdfs:Class .  \\
     156\\
     157cmdm:contains & a & rdf:Property ; \\
     158        & rdfs:domain & cmdm:Component ; \\
     159        & rdfs:range & :Component , :Element . \\
     160
     161%cmdm:containsAttribute & a &rdf:Property;
     162%          & rdfs:domain & :Component, :Element;
     163%          & rdfs:range & :Attribute.
     164
     165cmdm:Value & a & rdfs:Literal .  \\
     166cmdm:Entity & a & rdfs:Class .  \\
     167\\
     168cmdm:hasElementValue & a & rdf:Property ;  \\
     169              & rdfs:domain & cmdm:Element ;  \\
     170              & rdfs:range & cmdm:Value .  \\
     171 \\
     172\multicolumn{3}{l}{\# add a parallel separate property for the resolved entities}  \\
     173cmdm:hasElementEntity & a & rdf:Property ;  \\
     174              & rdfs:domain & :Element ;  \\
     175              & rdfs:range & :Entity .   \\
     176 \\
     177cmdm:hasAttributeValue & a & rdf:Property ;  \\
     178              & rdfs:domain & cmdm:Attribute ;  \\
     179              & rdfs:range & rdfs:Literal .  \\
     180
     181cmdm:hasAttributeEntity & a & rdf:Property ;  \\
     182              & rdfs:domain & :Attribute ;  \\
     183              & rdfs:range & :Entity .  \\
    147184\end{example3}
    148185
    149186\noindent
    150 This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
     187This entities are used for modelling the actual profiles, components and elements as they are defined in the Component Registry.
     188For stand-alone/top components, the IDs as issued by Component Registry can be used as entity IRIs. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197\#Actor\_Languages.Actor\_Language}).
     189
     190\commentx{Matej: shouldn't we add the name of the component in the IRI for human-readability?
     191similar to how it is generated in profile XSDs: \textless xs:simpleType name="simpletype-MimeType-clarin.eu.cr1.c\_1290431694511"\textgreater }
     192
    151193
    152194\label{table:rdf-cmd}
     
    155197 & rdfs:label & "collection"; \\
    156198 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
    157 cmd:Actor       & a & cmds:Component. \\
    158 cmd:LanguageName  & a & cmds:Element. \\
    159 \end{example3}
     199cr:clarin.eu:cr1:c\_1271859438197\#Actor  \\
     200& a &cmdm:Component. \\
     201\end{example3}
     202
     203\commentx{Menzo: we need more context for inner components. In the example LanguageName looks well defined, but take a Component/Element like Title. Is it the title of a book or the title of a person. Only when the semantics are clear, e.g., with a dcr:datcat, one can ignore the context and collapse all Components/Elements to a single RDF class/property.}
     204\commentx{Matej: wouldn't that be remedied by cmdm:contains? or is it too much inferencing?}
    160205
    161206\begin{notex}
    162 Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
     207Menzo: inner components don't have IDs so I propose a path build from the context up to a shareable component (we need some nice term for that, in the TDS I called it a top notion so maybe a top component. The cmd prefix also needs to be bound to a component specific URI. This URI contains the top component ID, e.g., \furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}.
    163208\end{notex}
    164209
    165210\subsection{Data Categories}
    166 Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties:
     211Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties
     212so as to avoid too strong semantic implications.
    167213
    168214\begin{example3}
     
    179225\end{example3}
    180226
     227\begin{comment}
    181228Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
    182229used usually directly as data properties:
     
    186233\end{example3}
    187234
    188 \noindent
    189 However, we argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.\cite{Windhouwer2012_LDL}
     235However, e argue against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties,
    190236In a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
    191237
     
    194240\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
    195241\#myNoun & owl:sameAs & isocat:DC-1333. \\
    196 \end{example3}
    197 
     242\end{example3}
     243 
     244\end{comment}
    198245
    199246\subsection{RELcat - Ontological relations}
    200 As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
     247As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
    201248
    202249\begin{example3}
     
    205252
    206253\noindent
    207 By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications.
    208 
    209 \begin{notex}
    210 Does this mean, that I would say:
    211 \begin{example3}
    212 rel:sameAs & owl:equivalentProperty & owl:sameAs
    213 \end{example3}
    214 
    215 to enable the inference of the equivalences?
    216 
     254By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications. The \code{rel:*} properties can be undrestood as an upper layer of a taxonony of relation types, implying a subtyping:
     255
     256\begin{example3}
     257rel:sameAs & rdfs:subPropertyOf & owl:sameAs
     258\end{example3}
     259
     260\commentx{Menzo: I would use owl:sameAs rdfs:subPropertyOf rel:sameAs. I see the rel:* properties as an upper layer of a taxonony of relation types. The RELcat types are loose and the OWL ones specific, hence the subtyping. In RELcat you might also query multiple graphs with multiple vocabularies various 'same-as' properties then still need to be distinguishable but the general rel:sameAs need to be created.}
     261
     262\commentx{Matej: strip this stipulations - rest of the subsection or just short referrer to SPIN rules ?}
     263\begin{comment}
    217264Is this correct:
    218265?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
     
    222269\end{example2}
    223270
     271\commentx{Menzo: yes. I do have some of the SPIN rules somewhere to generate those. My idea is that one takes a dcr:datcat annotated graph. This can be using OWL or SKOS or any other RDF vocabulary. This base graph should have been expanded depending on the reasoning one uses, i.e., all entailments are in place. The dcr:datcat can then be translated into rel:sameAs and all equivalences get expanded, so one can also query using ISOcat DCs.}
     272
    224273\noindent
    225274following facts need to be present in the ontology :
     
    229278cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
    230279isocat:DC-2538 & rel:sameAs & dc:created \\
    231 rel:sameAs & owl:equivalentProperty &  owl:sameAs \\
     280owl:sameAs & rdfs:subPropertyOf &  rel:sameAs \\
    232281$\rightarrow$ \\
    233282<lr1> & dc:created & 2012\^{}\^{}xs:year \\
    234283\end{example3}
    235 
    236 \end{notex}
    237 
    238 \noindent
    239 What about other relations we may want to express? (Do we need them and if yes, where to put them? – still in RR?) Examples:
    240 
    241 \begin{example3}
    242 cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
    243 clavas:Organization & owl:subClassOf & dcterms:Agent \\
    244 <org1> & a & clavas:Organization \\
    245 \end{example3}
    246 
     284\end{comment}
     285
     286
     287%%%%%%%%%%%%%%%%%%%%%
    247288\subsection{CMD instances}
    248 In the next step, we want to express the individual CMD instances, the metadata records:
    249 %, based on the previously defined entities on the schema level, but also entities from external ontologies.
     289In the next step, we want to express the individual CMD instances, the metadata records.
    250290
    251291\subsubsection {Resource Identifier}
    252292
     293\commentx{Matej: I still yearn for something like cmdm:Resource and cmdm:MDRecord}
     294\begin{example3}
     295<lr1> & a  & cmdm:Resource; \\
     296<lr1.cmd> & a & cmdm:MDRecord;
     297\end{example3}
     298
    253299It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
    254 If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}:
    255 
    256 \begin{example3}
    257 \_:anno1  & a & oa:Annotation; \\
    258  & oa:hasTarget  & <lr1>; \\
    259  & oa:hasBody  & <lr1.cmd>; \\
    260  & oa:motivatedBy  & oa:describing \\
     300If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
     301(Note also, that one MD record can describe multiple resources, this can be also easily accommodated in OpenAnnotation:
     302
     303\commentx{Menzo: also there can be multiple resource proxies. Maybe we can use an RDF list?}
     304
     305\begin{example3}
     306\_:anno1  & a & oa:Annotation ; \\
     307 & oa:hasTarget  & <lr1a>, <lr1b> ; \\
     308 & oa:hasBody  & <lr1.cmd> ; \\
     309 & oa:motivatedBy  & oa:describing . \\
    261310\end{example3}
    262311
     
    266315
    267316\begin{example3}
    268 <lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
    269  & dcterms:creator ??  & "\var{\{cmd:MdCreator\}}";  \\
    270  & dcterms:publisher  & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\
    271  & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\
     317<lr1.cmd> & dcterms:identifier  & <lr1.cmd> ;  \\
     318 & dcterms:creator  & \var{\{cmd:MdCreator\}} ;  \\
     319 & dcterms:publisher  & <http://clarin.eu> ; \\
     320 & dcterms:created & \var{\{cmd:MdCreated\}} . \\
    272321\end{example3}
    273322
     
    277326
    278327\begin{example3}
    279 <lr0.cmd>  & a   & ore:ResourceMap \\
    280 <lr0.cmd> & ore:describes & <lr0.agg> \\
    281 <lr0.agg> & a   & ore:Aggregation \\
    282 & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    283 \end{example3}
    284 
    285 \begin{notex}
    286 ?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?
    287 
     328<lr0.cmd>  & a   & ore:ResourceMap . \\
     329<lr0.cmd> & ore:describes & <lr0.agg> . \\
     330<lr0.agg> & a   & ore:Aggregation ; \\
     331& ore:aggregates  & <lr1.cmd>, <lr2.cmd> . \\
     332\end{example3}
     333
     334\commentx{Matej: Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?}
     335
     336\begin{comment}
    288337This is rather complicated: skip this?:
    289338Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
     
    297346 & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    298347\end{example3}
    299 \end{notex}
     348\end{comment}
    300349       
    301350\subsubsection{Components – nested structures}
    302 
    303 There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components:
    304 
    305 \begin{enumerate}%[a)]
    306 \item the components are encoded as object property
    307 
    308 \begin{example3}
    309 <lr1>  & cmd:Actor  & \_:Actor1 \\
    310 <lr1>  & cmd:Actor  & \_:Actor2 \\
    311 \_:Actor1  & cmd:motherTongue  & iso-639:aac \\
    312 \_:Actor2  & cmd:motherTongue  & iso-639:deu \\
    313 \_:Actor1  & cmd:role & "Interviewer" \\
    314 \_:Actor2 & cmd:role & "Speaker" \\
    315 \end{example3}
    316 
    317 
    318 \item a dedicated object property is used
    319 
    320 \begin{example3}
    321 \_:Actor1  & a & cmd:Actor \\
    322 <lr1> & cmd:contains & \_:Actor1 \\
    323 \end{example3}
    324 
    325 \end{enumerate}
     351For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
     352
     353\begin{example3}
     354\_:actor1  & a & cmd:Actor . \\
     355?? <lr1> ? & cmd:contains & \_:actor1 . \\
     356?? <lr1.cmd> ? & cmd:contains & \_:actor1 . \\
     357\end{example3}
    326358
    327359\subsection{Elements, Fields, Values}
    328360Finally, we want to integrate also the actual field values in the CMD records into the ontology.
    329 
    330 \subsubsection{Predicates}
    331 As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property:
    332 
     361As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
     362
     363While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples with the literal values mapped to semantic entities. Following example show the whole chain of statements from metamodel to literal value. The mapping process is detailed in \ref{sec:values2entities}.
     364
     365\begin{example3}
     366cmd:Person & a & cmdm:Component . \\
     367cmd:Organisation & a &  cmdm:Element . \\
     368cmd:hasOrganisationElementValue  \\
     369& rdfs:subProperyOf & cmdm:hasElementValue ; \\
     370        & rdfs:domain & cmd:Organisation ; \\
     371        & rdfs:range & xs:string . \\
     372cmd:hasOrganisationElementEntity \\
     373        & rdfs:subProperyOf & cmdm:hasElementEntity ; \\
     374        & rdfs:domain & cmd:Organisation ; \\
     375        & rdfs:range & cmd:OrganisationElementEnity .\\
     376\\
     377\multicolumn{3}{l}{\# person (mentioned in a MD record) has an affiliation (cmd:Person/cmd:Organisation) } \\
     378\_:pers  & a & cmd:Person ; \\
     379        & cmdm:contains & \_:org . \\
     380\_:org & a & cmd:Organisation ; \\
     381        & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
     382        & \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity  \quad <http://mpi.nl> . }\\
     383
     384<http://mpi.nl> & a  & cmd:OrganisationElementEnity .
     385\end{example3}
     386
     387\begin{comment}
    333388\begin{example3}
    334389cmd:timeCoverage  & a   & cmds:Element \\
     390cmd:timeCoverageValue & a & cmds:ElementValue \\
    335391cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
    336 <lr1>  & cmd:timeCoverage  & "19th century" \\
    337 
    338 \end{example3}
    339 
    340 \subsubsection{Literal values -- data properties}
    341 
    342 To generate triples with literal values is straightforward:
    343 
    344 \begin{example3}
    345 \var{lr:Resource} & \var{cmds:Property} & \var{xsd:string }\\
    346 <lr1> & cmd:Organisation & "MPI"
    347 \end{example3}
    348 
    349 \subsubsection{Mapping to entities -- object properties}
    350 
    351 The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities:
    352 
    353 \begin{example3}
    354 \var{lr:Resource} & \var{cmds:Property} & \var{xsd:anyURI}\\
    355 <lr1> & cmd:Organisation\_? & <org1> \\
     392<lr1> & cmd:contains & \_:timeCoverage1 \\
     393\_:timeCoverage1 & a & cmd:timeCoverage \\
     394\_:timeCoverage1 & cmd:timeCoverageValue & "19th century" \\
     395\end{example3}
     396
     397\commentx{Menzo: no need to repeat dcr:datcat in the instance.}
     398
     399\begin{example3}
     400\var{cmds:Element} & \var{cmds:ElementValue\_?} & \var{xsd:anyURI}\\
     401\_:organisation1 & cmd:OrganisationValue\_? & <org1> \\
    356402\end{example3}
    357403
     
    360406i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
    361407\end{notex}
    362 
    363 The mapping process is detailed in \ref{sec:values2entities}
     408\end{comment}
    364409
    365410
     
    367412\section{Mapping field values to semantic entities}
    368413\label{sec:values2entities}
     414
     415\commentx{this is probably definitely too much for one abstract  - so we could just anounce the need for this mapping process.}
    369416
    370417This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
     
    378425\end{enumerate}
    379426
    380 %\begin{figure*}[!ht]
    381 %\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
    382 %\caption{Sketch of the process of transforming the CMD metadata records to a RDF representation}
    383 %\label{fig:smc_cmd2lod}
    384 %\end{figure*}
    385 
    386 \subsubsection{Identify vocabularies  – CLAVAS}
    387 
    388 One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
     427This task is basically an application of ontology mapping method, trying to find for our ``anonymous'' concepts semantically equivalent concepts from other semantic resources / vocabularies.
     428% This is almost equivalent to the definition of ontology mapping function as given by \cite{EhrigSure2004, amrouch2012survey}: ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
     429
     430The transformation of the data has been partly described in previous section. It can be trivially automatically converted into RDF triples as :
     431
     432\begin{example3}
     433\_:organisation1 & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
     434\end{example3}
     435
     436However for the needs of the mapping task we propose to reduce and rewrite to retrieve distinct concept-value pairs:
     437
     438\begin{example3}
     439\_:1 & a  & cmd:OrganisationElementEnity . \\
     440   & skos:altLabel & "MPI";
     441\end{example3}
     442
     443\subsubsection{Identify vocabularies}
     444
     445One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property in the schema or data category definition (tentatively labeled \code{@clavas:vocabulary}, cf: \emph{CMD 1.2}).
    389446
    390447The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format. However, in general we have to assume/consider a number of different sources.
    391448
    392 \begin{note} Which sources? \end{note}
    393 
    394 Data in \xne{OpenSKOS} is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}. It also maintains links to other semantic resources:
    395 
    396 \begin{example3}
    397 <org1> & a   & skos:Concept; \\
    398 & skos:exactMatch  & <dbpedia/org1>, <lt-world/orgx>;
    399 \end{example3}
    400 
    401449\subsubsection{Lookup}
    402450
    403451In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
    404452
    405 \begin{definition}[{signature of the lookup function}]
     453%\begin{definition}[{signature of the lookup function}]
    406454\begin{equation}
    407455lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
    408456\end{equation}
    409 \end{definition}
     457%\end{definition}
    410458
    411459In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
    412 which will be the result of the previous step.
    413 
    414 \begin{definition}[{list available semantic resources for data categories}]
     460which will be the result of the previous step \
     461
     462
     463%\begin{definition}{Required configuration data indicating data category to available }
    415464\begin{equation}
    416465DataCategory \quad \mapsto \quad SemanticResource+
    417466\end{equation}
    418 \end{definition}
     467%\end{definition}
     468
    419469
    420470As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
    421 However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets.
    422 \begin{note}
    423 Figure \ref{fig:vocabulary_proxy} sketches the general setup.
    424 \end{note}
    425 
    426 %\begin{figure*}[!ht]
    427 %\includegraphics[width=1\textwidth]{VocabularyProxy_clientapp}
    428 %\caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service}
    429 %\label{fig:vocabulary_proxy}
    430 %\end{figure*}
     471However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces.
    431472
    432473\subsubsection{Candidate evaluation}
    433 The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct.
    434 
    435 One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
    436 
    437 In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
     474The lookup is the most sensitive step in the process, being the gate between ``strings'' and semantic entities. In general, the resulting candidates cannot be seen as reliable matches and should undergo further scrutiny to ensure that the match is semantically correct. In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data.
     475
     476%One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
    438477
    439478\section{Implementation}
    440479
    441 \subsection{Transformation}
    442 
    443 Set of XSL-stylesheets
    444 
    445 \subsection{LOD Application}
    446 
    447 Virtuoso
     480The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets.
     481Once the data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, a integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
     482
     483% Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
     484
    448485
    449486\section{Conclusions and Future Work}
    450 In this paper, we proposed an encoding of the whole of the CMD data domain in RDF, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
     487In this paper, we proposed an encoding of the whole of the CMD data domain in RDF, with special focus on the core model the general component schema. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
     488In the near future, a test with the whole CMD dataset will be performed.
     489And work on mapping values to entities.
     490
     491With this new enhanced dataset, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility of exploring the dataset using external semantic resources.
     492The user can access the data indirectly by browsing external vocabularies/taxonomies, with which the data will be linked like vocabularies of organizations or taxonomies of resource types.
     493
    451494
    452495
Note: See TracChangeset for help on using the changeset viewer.