Ignore:
Timestamp:
09/30/13 11:54:57 (11 years ago)
Author:
vronk
Message:

major reorganization, detailing of Design-chapters; abstract_en

File:
1 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Design_SMCinstance.tex

    r3553 r3638  
    1 \chapter{System design - mapping on instance level}
     1\chapter{Mapping on instance level, CMD as LOD}
    22\label{ch:design-instance}
     3
    34\begin{quotation}
    4 I do think that ISOcat, CLAVAS, RELcat, an actual language
     5I do think that ISOcat, CLAVAS, RELcat and actual language
    56resource all provide a part of the semantic network.
    67
     
    1112relevant parts in a triple store and do your SPARQL/reasoning on it. Well
    1213that's where I'm ultimately heading with all these registries related to
    13 semantic interoperability ... I hope ;-)
     14semantic interoperability ... I hope ;-)\cite{Menzo2013mail}
    1415\end{quotation}
    15 \cite{Menzo2013mail}
    16 
    17 
    18 Linked Data - Express dataset in RDF
    19 
    20 
    21 Partly as by-product of the entities-mapping effort we will get the metadata rendered in RDF, linked with
    22 So theoretically we then only need to provide them ``on the web'', to make them a nucleus of the LinkedData-Cloud.
    23 
    24 
    25 Technical aspects (RDF-store?) / interface (ontology browser?)
    26 
    27 \todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
    28 
    29 \todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
    30 
    31 defining the Mapping:
    32 \begin{enumerate}
    33 \item convert to RDF
    34 translate: MDRecord $\rightarrow$ [\#mdrecord \#property literal]
    35 \item map: \#mdrecord \#property literal  $\rightarrow$ [\#mdrecord \#property \#entity]
    36 \end{enumerate}
    37 
    38 \begin{figure*}[!ht]
    39 \includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
    40 \caption{The process of transforming the CMD metadata records to and RDF representation}
    41 \label{fig:smc_cmd2lod}
    42 \end{figure*}
    43 
     16
     17As described in previous chapters (\ref{ch:infrastructure},\ref{ch:design_schema}), semantic interoperability is one of the main motivations for the CMD infrastructure. However, this machinery pertains mostly to the schema level, the actual values in the fields of CMD instances reman ``just strings''. This is the case even though the problem of different labels for semantically equivalent or even identical entities is even more so virulent on the instance level. While for a number of metadata fields the value domain can be enforced through schema validation, some important fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities (as the instance data shows, some organizations are referred to by more than 20 different labels, or spelling variants.) prompting an urgent need for better means for harmonizing the constrained-field values.
     18
     19One potential remedy is the use of reference datasets -- controlled vocabularies, taxonomies, ontologies and such. In fact, this is a very common approach, be it the authority files in libraries world, or domain-specific reference vocabularies maintained by practically every research community. Not as strict as schema definitions, they cannot be used for validation, but still help to harmonize the data, by offering preferred labels and identifiers for entities.
     20
     21In this chapter, we explore how this general approach can be employed for our specific problem of harmonizing the (literal) values in selected instance fields and mapping them to entities defined in corresponding vocabularies. This proposal is furthermore embedded in a more general effort to \textbf{express the whole of the CMD data domain (model and instances) in RDF} constituting one large ontology interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies). This result lays a foundation for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006}
     22as well as for real semantic (ontology-driven) search and exploration of the data.
     23
     24The following section \ref{sec:cmd2rdf} lays out how individual parts of the CMD framework can be expressed in RDF.
     25In \ref{sec:values2entities} we investigate in further detail the abovementioned critical aspect of the effort, namely the task of translating the string values in metadata fields to corresponding semantic entities. Finally, the technical aspects of providing the resulting ontology as LOD and the implications for an ontology-driven semantic search are tackled briefly in \ref{sec:lod} and \ref{semantic-search} respectively.
    4426
    4527\section{CMD to RDF}
    46 \label{ch:cmd2rdf}
    47 
    48 A few modules/components of the CMD infrastructure are dedicated to semantic interoperability. The DCR as global registry for concepts, CLAVAS for maintaining controlled vocabularies in SKOS format, RR for expressing arbitrary relations between concepts.
    49 However, the actual values in the CMD instances are ``just strings'' and for the most part cannot be validated by the schema, although they often could be mapped to a corresponding controlled vocabulary.
    50 
    51 Thus one aim of this work is to express the whole of the CMD data (model and instances) in RDF. This would allow to map the string values in selected fields to semantic entities, which in turn would allow real semantic (ontology-driven) search and bring about a linking with the web of data \todocite{Web of Data, TimBL}
    52 
    53 The following chapter lays out, how individual parts of the CMD framework can be expressed in RDF
     28\label{sec:cmd2rdf}
     29In this section, RDF encoding is proposed for all levels of the CMD data domain:
     30
     31\begin{itemize}
     32\item CMD meta model
     33\item profile definitions
     34\item the administrative and structural information of CMD records
     35\item individual values in the fields of the CMD records
     36\end{itemize}
    5437
    5538\subsection{CMD specification}
    56 The meta model
     39
     40The main entity of the meta model is the CMD component and is typed as specialization of the \code{owl:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation:
    5741
    5842\label{table:rdf-spec}
    59 \begin{example}
    60 cmd\_spec:Profile & subClassOf  & owl:Class. \\
    61 cmd\_spec:Component & subClassOf  & owl:Class. \\
    62 cmd\_spec:Element & subClassOf  & rdf:Property. \\
    63 \end{example}
    64 
    65 Typing the profiles, components and elements:
     43\begin{example3}
     44cmds:Component & subClassOf  & owl:Class. \\
     45cmds:Profile & subClassOf  & cmds:Component. \\
     46cmds:Element & subClassOf  & rdf:Property. \\
     47\end{example3}
     48
     49\noindent
     50This entities are used for typing the actual profiles, components and elements (as they are defined in the Component Registry):
    6651
    6752\label{table:rdf-cmd}
    68 \begin{example}
    69 cmd:collection & a & cmd\_spec:Profile; \\
    70  & rdfs:label & `collection'; \\
     53\begin{example3}
     54cmd:collection & a & cmds:Profile; \\
     55 & rdfs:label & "collection"; \\
    7156 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
    72 cmd:Actor       & a & cmd\_spec:Component. \\
    73 cmd:LanguageName  & a & cmd\_spec:Element. \\
    74 \end{example}
     57cmd:Actor       & a & cmds:Component. \\
     58cmd:LanguageName  & a & cmds:Element. \\
     59\end{example3}
    7560
    7661\begin{note}
    77 Should the ID assigned in the component registry  for the CMD entities  used as ID in rdf, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
     62Should the ID assigned in the Component Registry  for the CMD entities be used as identifier in RDF, or rather the verbose name? (if yes, how to ensure uniqueness – generate the name from the cmd-path?)
    7863\end{note}
    7964
    8065\subsection{Data Categories}
    81 Windhouwer (2012) proposes to use the data categories as annotation properties.
    82 Definition of the annotation property \code{dcr:datcat}
    83 
    84 \begin{example}
     66Windhouwer \cite{Windhouwer2012_LDL} proposes to use the data categories as annotation properties:
     67
     68\begin{example3}
    8569dcr:datcat & a  & owl:AnnotationProperty ; \\
    8670 & rdfs:label  & "data category"@en ; \\
    87  & rdfs:comment  & "This resource is equivalent to  \\
    88 this data category."@en ; \\
    89  & skos:note  & "The data category should be  \\
    90  &   & identified by its PID."@en ; \\
    91 \end{example}
    92 
    93 Still, leaving open the possibility for “a stronger semantic link” :
    94 \begin{quotation}
    95 By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
    96 \end{quotation}
    97 
    98 For classes the OWL 2 \code{owl:equivalentClass} can be used, for example:
    99 
    100 \begin{example}
    101 \#myPOS & owl:equivalentClass & isocat:DC-1345. \\
    102 \end{example}
    103 
    104 For properties OWL 2 provides \code{owl:equivalentProperty}, for example:
    105 
    106 \begin{example}
    107 \#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
    108 \end{example}
    109 
    110 Finally \code{owl:sameAs} can be used for individuals, for example:
    111 
    112 \begin{example}
    113 \#myNoun & owl:sameAs & isocat:DC-1333. \\
    114 \end{example}
    115 
    116 
    117 ISOcat provides a RDF representation of the data categories :
    118 
    119 \begin{example}
     71 & rdfs:comment  & "This resource is equivalent to  this data category."@en ; \\
     72 & skos:note  & "The data category should be identified by its PID."@en ; \\
     73\end{example3}
     74
     75That implies that the \code{@ConceptLink} attribute on CMD elements and components as used in the CMD profiles to reference the data category would be modelled as:
     76
     77\begin{example3}
     78cmd:LanguageName & dcr:datcat & isocat:DC-2484. \\
     79\end{example3}
     80
     81Encoding data categories as annotation properties is in contrast to the common approach seen with dublincore terms
     82used usually directly as data properties:
     83
     84\begin{example3}
     85<lr1> & dc:title & "Language Resource 1"
     86\end{example3}
     87
     88\noindent
     89Analogously, we could model \xne{ISOcat} data categories as data properties, i.e. metadata elements referencing ISOcat data categories could be encoded as follows:
     90
     91\begin{example3}
     92<lr1> & isocat:DC-2502 & "19th century"
     93\end{example3}
     94
     95\noindent
     96However, Windhouwer\cite{Windhouwer2012_LDL} argues against direct mapping of complex data categories to data properties and in favour of modelling data categories as annotation properties, so as to avoid too strong semantic implications.
     97
     98This raises the vice-versa question, whether to rather handle all data categories uniformly, which would mean encoding dublincore terms also as annotation properties, but the pragmatic view dictates to encode the data in line with the prevailing approach, i.e. express dublincore terms directly as data properties.
     99
     100
     101\noindent
     102The REST web service of \xne{ISOcat} provides a RDF representation of the data categories:
     103
     104\begin{example3}
    120105isocat:languageName & dcr:datcat & isocat:DC-2484; \\
    121106 & rdfs:label & "language name"@en; \\
    122107 & rdfs:comment & "A human understandable..."@en; \\
    123108 & 
  \\
    124 \end{example}
     109\end{example3}
     110
     111However this is only meant as template, as is stated in the explanatory comment of the exported data:
     112
     113\begin{quotation}
     114By default the RDF export inserts \code{dcr:datcat} annotation properties to maintain the link between the generated RDF resources and the used Data Categories. However, it is possible to also maintain a stronger semantic link when the RDF resources will be used as OWL (2) classes, properties or individuals.
     115\end{quotation}
     116
     117So in a specific (OWL 2) application the relation with the data categories can be expressed as \code{owl:equivalentClass} for classes, \code{owl:equivalentProperty} for properties or \code{owl:sameAs} for individuals:
     118
     119\begin{example3}
     120\#myPOS & owl:equivalentClass & isocat:DC-1345. \\
     121\#myPOS & owl:equivalentProperty & isocat:DC-1345. \\
     122\#myNoun & owl:sameAs & isocat:DC-1333. \\
     123\end{example3}
     124
     125
     126\subsection{RELcat - Ontological relations}
     127As described in \ref{def:rr} relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples\cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
     128
     129\begin{example3}
     130isocat:DC-2538 & rel:sameAs & dct:date
     131\end{example3}
     132
     133\noindent
     134By design, the relations in Relation Registry are not expressed with predicates from known vocabularies like \xne{SKOS} or \xne{OWL}, again with the aim to avoid too strong semantic implications. This leaves leeway for further specialization of the relations in specific applications.
    125135
    126136\begin{note}
    127 Output from isocat is only meant as template!
    128 
    129 In the RDF representation, the data categories seem to be referenced by their mnemonicIdentifier (rdf:ID=”languageName”) how is this guaranteed URI and how is the data category meant to be referred to?
     137Does this mean, that I would say:
     138\begin{example3}
     139rel:sameAs & owl:equivalentProperty & owl:sameAs
     140\end{example3}
     141
     142to enable the inference of the equivalences?
     143
     144Is this correct:
    130145\end{note}
    131 
    132 Finally, the ConceptLink attribute used in the CMD profiles to reference the data category is modelled as:
    133 
    134 \begin{example}
    135 cmd:LanguageName & dcr:datcat & isocat:DC-248. \\
    136 \end{example}
    137 
     146?? That means, that to be able to infer that a value in a CMD element also pertains to a given data category, e.g.:
     147
     148\begin{example2}
     149 cmd:PublicationYear = 2012 $\rightarrow$ & dc:created = 2012
     150\end{example2}
     151
     152\noindent
     153following facts need to be present in the ontology :
     154
     155\begin{example3}
     156<lr1> & cmd:PublicationYear & 2012\^{}\^{}xs:year \\
     157cmd:PublicationYear &  owl:equivalentProperty & isocat:DC-2538 \\
     158isocat:DC-2538 & rel:sameAs & dc:created \\
     159rel:sameAs & owl:equivalentProperty &  owl:sameAs \\
     160$\rightarrow$ \\
     161<lr1> & dc:created & 2012\^{}\^{}xs:year \\
     162\end{example3}
     163
     164\noindent
     165What about other relations we may want to express? (Do we need them and if yes, where to put them? – still in RR?) Examples:
     166
     167\begin{example3}
     168cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
     169clavas:Organization & owl:subClassOf & dcterms:Agent \\
     170<org1> & a & clavas:Organization \\
     171\end{example3}
    138172
    139173\subsection{CMD instances}
    140 
     174In the next step, we want to express the individual CMD instances, the metadata records, making use of the previously defined entities on the schema level, but also entities from external ontologies.
    141175
    142176\subsubsection {Resource Identifier}
    143177
    144 It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID . Alternatively we could use the PID of the MD record ( \code{<lr1.cmd>}  from \code{<cmd:MdSelfLink>}) as the resource identifier.
    145 The relationship between the resource and the metadata record could be expressed as an annotation :
    146 
    147 \begin{example}
     178It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
     179If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}:
     180
     181\begin{example3}
    148182\_:anno1  & a & oa:Annotation; \\
    149183 & oa:hasTarget  & <lr1>; \\
    150184 & oa:hasBody  & <lr1.cmd>; \\
    151185 & oa:motivatedBy  & oa:describing \\
    152 \end{example}
    153 
    154 \subsection{Provenance}
    155 
    156 Use the information from CMD-Header for information about the modelled data  :
    157 
    158 \begin{example}
    159 <lr1.cmd>
    160  & dcterms:identifier  & <lr1.cmd>;  \\
    161  & dcterms:creator ??  & "\{<cmd:MdCreator>\}";  \\
    162 \end{example}
    163 
    164 Other proposed fields:
    165 
    166 \begin{example}
    167  & dcterms:publisher  & <http://clarin.eu>,  \\
    168  & <provider-oai-accesspoint>; ?? \\
    169  & dcterms:created/modified “\{<cmd:MdCreated>\}” ?? \\
    170 \end{example}
     186\end{example3}
     187
     188\subsubsection{Provenance}
     189
     190The information from \code{cmd:Header} represents the provenance information about the modelled data:
     191
     192\begin{example3}
     193<lr1.cmd> & dcterms:identifier  & <lr1.cmd>;  \\
     194 & dcterms:creator ??  & "\var{\{cmd:MdCreator\}}";  \\
     195 & dcterms:publisher  & <http://clarin.eu>, <provider-oai-accesspoint>; ?? \\
     196 & dcterms:created /dcterms:modified? & "\var{\{cmd:MdCreated\}}" ?? \\
     197\end{example3}
    171198
    172199\subsubsection{Hierarchy ( Resource Proxy – IsPartOf)}
    173 In CMD, <cmd:ResourceProxyList> is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modeled as OAI-ORE Aggregation\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
    174 \furl{http://openannotation.org/spec/core/core.html\#Motivations}
     200In CMD, the \code{cmd:ResourceProxyList} structure is used to express both collection hierarchy and point to resource(s) described by the MD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
    175201:
    176202
    177 \begin{example}
     203\begin{example3}
    178204<lr0.cmd>  & a   & ore:ResourceMap \\
    179205<lr0.cmd> & ore:describes & <lr0.agg> \\
    180206<lr0.agg> & a   & ore:Aggregation \\
    181 ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    182 \end{example}
    183 
    184  
     207& ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
     208\end{example3}
     209
     210\noindent
    185211?? Should both collection hierarchy and resource-pointers (collection and resource MD records) be encoded as ore:Aggregation?
    186 Additionally the flat header field <cmd:MdCollectionDisplayName> has been introduced to indicate by simple means the collection, of which given resource is part.
    187 This information can be used to generate a separate one-level grouping of the resources, in which the value from the <cmd:MdCollectionDisplayName> element would be used as the label of an otherwise undefined ore:ResourceMap.
     212Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
     213This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
    188214Even the identifier/ URI for this collections is not clear. Although this collections should match with the ResourceProxy hierarchy, there is no guarantee for this, thus a 1:1 mapping cannot be expected.
     215
    189216\todocode{check consistency for MdCollectionDisplayName vs. IsPartOf in the instance data}
    190217
    191 \begin{example}
     218\begin{example3}
    192219\_:mdcoll  & a   & ore:ResourceMap; \\
    193220 & rdfs:label & "Collection 1"; \\
    194 \_:mdcoll\#aggregation & a   & ore:Aggregation \\
     221\_:mdcoll\#aggreg & a   & ore:Aggregation \\
    195222 & ore:aggregates  & <lr1.cmd>, <lr2.cmd>; \\
    196 \end{example}
    197 
     223\end{example3}
    198224       
    199225\subsubsection{Components – nested structures}
    200226
    201 \begin{note}
    202 ?? Model (instance) components as blank nodes via objectProperty:
    203 \end{note}
    204 
    205 \begin{example}
     227There are two variants to express the tree structure of the CMD records, i.e. the containment relation between the components:
     228
     229\begin{enumerate}[a)]
     230\item the components are encoded as object property
     231
     232\begin{example3}
    206233<lr1>  & cmd:Actor  & \_:Actor1 \\
    207234<lr1>  & cmd:Actor  & \_:Actor2 \\
     
    210237\_:Actor1  & cmd:role & "Interviewer" \\
    211238\_:Actor2 & cmd:role & "Speaker" \\
    212 \end{example}
    213 
    214 ?? or rather as Classes (and express the containement hierarchy with some extra predicate):
    215 \begin{example}
     239\end{example3}
     240
     241\item a dedicated object property is used
     242
     243\begin{example3}
    216244\_:Actor1  & a & cmd:Actor \\
    217245<lr1> & cmd:contains & \_:Actor1 \\
    218 \end{example}
    219 
    220 \subsubsection{Elements, Fields, Values}
    221 
    222 There are two steps to the modeling of the actual values in the fields of CMD records in RDF. The first one is to express the values as triples with literal values, then for selected fields – using the literal values – try to find corresponding entities in appropriate controlled vocabularies and generate new triples.
    223 There seems to need to be a separate property (predicate) for fields that are mapped to entities, like:
    224 
    225 \begin{example}
    226 <lr1> & cmd:Organisation & "MPI" \\
    227 <lr1> & cmd:Organisation\_? & <org1> \\
    228 \end{example}
    229 
    230 %\subsubsection{Literal Values}
    231 \paragraph{Literal Values}
    232 
    233 Usually, RDF-mapping of dublincore descriptions is to data properties (cf. OLAC-DcmiTerms profile )
    234 
    235 \begin{example}
    236 <lr1> & dct:title & "Language Resource 1"
    237 \end{example}
    238 
    239 Analogously, we could model isocat data categories  as data properties . Metadata elements referencing ISOcat datacategories could be encoded as follows:
    240 
    241 \begin{example}
    242 <lr1> & isocat:DC-2502 & "19th century"
    243 \end{example}
    244 
    245 However, Windhouwer (2012) argues against direct mapping of complex data categories to data properties, but proposes to rather model data categories as annotation properties.
    246 
    247 \begin{example}
    248 cmd:timeCoverage  & a   & cmd\_spec:Element \\
     246\end{example3}
     247
     248\end{enumerate}
     249
     250\subsection{Elements, Fields, Values}
     251Finally, we want to integrate also the actual field values in the CMD records into the ontology.
     252
     253\subsubsection{Predicates}
     254As explained before CMD elements are typed as \code{rdf:Property} with the corresponding data category expressed as annotation property:
     255
     256\begin{example3}
     257cmd:timeCoverage  & a   & cmds:Element \\
    249258cmd:timeCoverage  & dcr:datcat  & isocat:DC-2502 \\
    250259<lr1>  & cmd:timeCoverage  & "19th century" \\
    251 ...
    252 \end{example}
    253 
    254 This raises the vice-versa question, whether to rather handle all data categories uniformly, thus encoding dublincore terms also as annotation properties.
    255 
    256 %\subsubsection{Mapping to entities – Vocabularies  – CLAVAS}
    257 \paragraph{Mapping to entities – Vocabularies  – CLAVAS}
    258 
    259 A major (if not the main) motivation for the CMD to RDF mapping is the wish to have better control over  and better quality of values in metadata fields with constrained value domain like organization or resource type. As the allowed values for these fields often cannot be explicitly enumerated, it is not possible to restrict them by means of an XML schema. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.)
    260 Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies. The main provider of relevant vocabularies is ISOcat and CLAVAS  – a service for managing and providing vocabularies in SKOS format. Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that for our purposes we can assume OpenSKOS as the one source of vocabularies.
    261 Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \xne{skos:Concepts}:
    262 
    263 \begin{example}
     260
     261\end{example3}
     262
     263\subsubsection{Literal values -- data properties}
     264
     265To generate triples with literal values is straightforward:
     266
     267\begin{definition}{Literal triples}
     268lr:Resource \ \quad cmds:Property \ \quad xsd:string
     269\end{definition}
     270
     271\begin{example3}
     272<lr1> & cmd:Organisation & "MPI" \\
     273\end{example3}
     274
     275\subsubsection{Mapping to entities -- object properties}
     276
     277The more challenging but also more valuable aspect is to generate objectProperty triples with the literal values mapped to semantic entities:
     278
     279\begin{definition}{new RDF triples}
     280lr:Resource \ \quad cmd:Property \ \quad xsd:anyURI
     281\end{definition}
     282
     283\begin{example3}
     284<lr1> & cmd:Organisation\_? & <org1> \\
     285\end{example3}
     286
     287\begin{note}
     288Don't we need a separate property (predicate) for the triples with object properties pointing to entities,
     289i.e. \code{cmd:Organisation\_} additionally to \code{cmd:Organisation}
     290\end{note}
     291
     292The mapping process is detailed in \ref{sec:values2entities}
     293
     294%%%%%%%%%%%%%%%%%55
     295\section{Mapping field values to semantic entities}
     296\label{sec:values2entities}
     297
     298This task is a prerequisite to be able to express also the CMD instance data in RDF. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) matching the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples. It involves following steps:
     299
     300\begin{enumerate}
     301\item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
     302\item extract \emph{distinct data category, value pairs} from the metadata records
     303\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
     304\item assess the reliability of the match
     305\item generate new RDF triples with entity identifiers as object properties
     306\end{enumerate}
     307
     308\begin{figure*}[!ht]
     309\includegraphics[width=1\textwidth]{images/SMC_CMD2LOD}
     310\caption{Sketch of the process of transforming the CMD metadata records to a RDF representation}
     311\label{fig:smc_cmd2lod}
     312\end{figure*}
     313
     314\subsubsection{Identify vocabularies  – CLAVAS}
     315
     316\todoin{Identify related ontologies, vocabularies? - see DARIAH:CV}
     317LT-World \cite{Joerg2010}
     318
     319One generic way to indicate vocabularies for given metadata fields or data categories being discussed in the CMD community is to use dedicated annotation property (tentatively \code{@clavas:vocabulary}) in the schema or data category definition. For such a mechanism to work, the consuming applications (like metadata editor) need to be made aware of this convention and interpret it accordingly.
     320
     321The primary provider of relevant vocabularies is \xne{ISOcat} and \xne{CLAVAS} – a service for managing and providing vocabularies in SKOS format (cf. \ref{def:CLAVAS}). Closed and corresponding simple data categories are already being exported from ISOcat in SKOS format and imported into CLAVAS/OpenSKOS and also other relevant vocabularies shall be ingested into this system, so that we can assume OpenSKOS as a first source of vocabularies. However definitely not all of the existing reference data will be hosted by OpenSKOS, so in general we have to assume/consider a number of different sources (cf. \ref{refdata}).
     322
     323Data in OpenSKOS is modelled purely in SKOS, so there is no more specific typing of the entities in the vocabularies, but rather all the entities are \code{skos:Concepts}:
     324
     325\begin{example3}
    264326<org1> & a   & skos:Concept \\
    265 \end{example}
    266 
    267 We may want to add some more typing and introduce classes for entities from individual vocabularies like clavas:Organization or similar.
    268 As far as CLAVAS will also maintain mappings/links to other datasets:
    269 
    270 \begin{example}
    271 <org1>   skos:exactMatch    <dbpedia/org1>, <lt-world/orgx>;
    272 \end{example}
    273 
     327\end{example3}
     328
     329\noindent
     330We may want to add some more typing and introduce classes for entities from individual vocabularies like \code{clavas:Organization} or similar. As far as CLAVAS will also maintain mappings/links to other datasets
     331
     332\begin{example3}
     333<org1> & skos:exactMatch  & <dbpedia/org1>, <lt-world/orgx>;
     334\end{example3}
     335
     336\noindent
    274337we could use it to expand the data with alternative identifiers, fostering the interlinking of data:
    275338
    276 \begin{example}
    277 <org1>   dcterms:identifier <org1>, <dbpedia/org1>, <lt-world/orgx>;
    278 \end{example}
    279 
    280 
    281 
    282 \paragraph{Mapping from strings to Entities}
    283 
    284 Find matching entities in selected Ontologies based on the textual values in the metadata records.
    285 
    286 
    287 Identify related ontologies:
    288 LT-World \cite{Joerg2010}
    289 
    290 task:
    291 \begin{enumerate}
    292 \item  express MDRecords in RDF
    293 \item  identify related ontologies/vocabularies (category $\rightarrow$ vocabulary)
    294 \item  use a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
    295 
    296 %\fbox{ function lookup: Category x String -> ConceptualDomain}
    297 \begin{eqnarray*}
    298 lookup(Category, Literal) \rightarrow ConceptualDomain??
    299 \end{eqnarray*}
    300 
    301 
    302 Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
    303 \end{enumerate}
    304 
    305 
    306 
    307 \subsection{RELcat - Ontological relations}
    308 Information in RELcat is already stored in RDF \cite{SchuurmanWindhouwer2011}.  One relation from the example relation set for CMDI :
    309 
    310 \begin{example}
    311 isocat:DC-2538 rel:sameAs dct:date
    312 \end{example}
    313 
    314 Should we generate the redundant triples based on the relations defined between data categories?  I.e.  if there is a relation and a resource has value:
    315 
    316 \begin{example}
    317 <lr1> isocat:DC-2538 2012^^xs:year
    318 \end{example}
    319 
    320 should we generate
    321 
    322 \begin{example}
    323 <lr1> dct:date 2012^^xs:year
    324 \end{example}
    325 
    326 ?
    327 
    328 What about other relations we may want to express? (Do we need them and if yes, where to put them? – still in RR?) Examples:
    329 
    330 \begin{example}
    331 cmd:MDCreator   & owl:subClassOf & dcterms:Agent \\
    332 clavas:Organization & owl:subClassOf & dcterms:Agent \\
    333 <org1> & a & clavas:Organization \\
    334 \end{example}
    335 
    336 
    337 
     339\begin{example3}
     340<org1>  & dcterms:identifier  & <org1>, <dbpedia/org1>, <lt-world/orgx>;
     341\end{example3}
     342
     343\subsubsection{Lookup}
     344
     345In abstract term, the lookup function takes as input the identifier of data category (or CMD element) and a literal string value and returns a list of potentially matching entities. Before actual lookup, there may have to be some string-normalizing preprocessing.
     346
     347\begin{definition}{signature of the lookup function}
     348lookup \ ( \ DataCategory \ ,  \ Literal \ )  \quad \mapsto \quad ( \ Concept \ | \ Entity \ )*
     349\end{definition}
     350
     351In the implementation, there needs to be additional initial configuration input, identifying datasets for given data categories,
     352which will be the result of the previous step.
     353
     354\begin{definition}{Required configuration data indicating data category to available }
     355DataCategory \quad \mapsto \quad Dataset+
     356\end{definition}
     357
     358As for the implementation, in the initial setup the system could resort to the \code{find}-interface provided by \xne{OpenSKOS}.
     359However, in the long term a more general solution is required, a kind of hybrid \emph{vocabulary proxy service} that allows to search in a number of datasets, many of them distributed and available via different interfaces. Figure \ref{fig:vocabulary_proxy} sketches the general setup. The service has to be able to a) proxy search requests to a number of search interfaces (SRU, SPARQL), b) fetch, cache and search in datasets.
     360
     361\begin{figure*}[!ht]
     362\includegraphics[width=1\textwidth]{images/VocabularyProxy_clientapp}
     363\caption{Sketch of a general setup for vocabulary lookup via a \xne{VocabularyProxy} service}
     364\label{fig:vocabulary_proxy}
     365\end{figure*}
     366
     367\subsubsection{Candidate evaluation}
     368The lookup is the most sensitive step in the process, as that is the gate between strings and semantic entities. In general, the resulting candidates cannot be seen as reliable and should undergo further scrutiny to ensure that the match is semantically correct.
     369
     370One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
     371
     372In some situation this ambiguities can be resolved algorithmically, but in the end in many cases it will require human curation of the generated data. In this respect, it is worth to note, that the CLARIN search engine VLO provides a feedback link, that allows even the normal user to report on problems or inconsistencies in CMD records.
     373
     374
     375%%%%%%%%%%%%%%%%%%%%%
    338376\section{SMC LOD - Semantic Web Application}
     377\label{sec:lod}
    339378
    340379\todoin{read: Europeana RDF Store Report}
    341380
     381Technical aspects (RDF-store?): Virtuoso
     382
    342383\todocode{install Jena +  fuseki}\furl{http://jena.apache.org}\furl{http://jena.apache.org/documentation/serving_data/index.html}\furl{http://csarven.ca/how-to-create-a-linked-data-site}
    343384
     
    345386
    346387\todocode{check install siren}\furl{http://siren.sindice.com/}
     388
     389
     390\todocode{check/install: raptor for generating dot out of rdf}\furl{http://librdf.org/raptor/}
     391
     392\todocode{check/install: Linked Data browser: LoD p. 81; Haystack}\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
     393
     394 / interface (ontology browser?)
    347395
    348396semantic search component in the Linked Media Framework
     
    353401
    354402\section {Full semantic search - concept-based + ontology-driven ?}
     403\label{semantic-search}
    355404
    356405With the new enhanced dataset, as detailed in section \ref{ch:cmd2rdf}, the groundwork is laid for the full-blown semantic search as proposed in the original goals, i.e. the possibility for ontology-driven or at least `semantic resources assisted' exploration of the dataset.
    357406
    358407Namely to enhance it by employing ontological resources.
    359 Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
    360 
     408Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies, with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
     409
     410
     411SPARQL
     412
     413rechercheisidore, dbpedia, ...
    361414
    362415\section{Summary}
    363 
    364 
    365 
     416In this chapter, an expression of the whole of the CMD data domain into RDF was proposed, with special focus on the way how to translate the string values in metadata fields to corresponding semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data and the implications for real semantic ontology-based data exploration.
     417
Note: See TracChangeset for help on using the changeset viewer.