Changeset 3837 for CMDI-Interoperability


Ignore:
Timestamp:
10/22/13 12:00:38 (11 years ago)
Author:
vronk
Message:

further reduced paper (stripped: relcat and mapping values to entities)

File:
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3818 r3837  
    110110\section{LOD -- Linked Open Data}
    111111%
    112 Linked Data\cite{TimBL2006}, RDF\cite{RDF2004}
    113 
    114 dbpedia, Yago - huge compiled knowledgebases to link to...
    115 
    116 Ontology for Language Technology: LT-World \cite{Joerg2010}
    117 
    118 LOD cloud Cyganiak and Jentzsch\cite{Cyganiak2010}.
     112The main added value of LOD\cite{TimBL2006} is the interconnecting of disparate datasets.
     113In the broader context of LOD, there is meanwhile a subgroup/subcommunity specializing
     114in linking linguistic data \cite{ldl2012}, that renders an obvious pool of candidate
     115datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.
     116Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. with the ISO-639-3 language identifiers, as they are used in CMD data.
     117\xne{lexvo} also seems suitable as it is already linked with a number of LDL datasets among others \xne{WALS}, \xne{lingvoj}, \xne{Glottolog}.
     118Of course, language is just one dimension to use for linking/mapping.
     119Step by step we will link other categories like countries, geolocations, organisations, etc.
     120to some of the central nodes of the LOD cloud \cite{Cyganiak2010}, like \xne{dbpedia}, \xne{Yago} or \xne{geonames},
     121but also domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
    119122
    120123
     
    181184\noindent
    182185This entities are used for modelling the actual profiles, components and elements as they are defined in the Component Registry.
    183 For stand-alone/top components, the IRI is the exact path into the CR to get the RDF representation for the profile/component. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component IRI and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}).\footnote{For the sake of readability, we will collapse the component IRIs, stripping them of the identifier and referring to them just by their name, prefixed with \code{cmd:}.}
    184 
    185 \label{table:rdf-cmd}
     186For stand-alone/top components, the IRI\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf} is the exact path into the CR to get the RDF representation for the profile/component. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component IRI and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}).\footnote{For the sake of readability, we will collapse the component IRIs, refer to them just by their name, prefixed with \code{cmd:}.}
     187
    186188\begin{example3}
    187189cmd:collection & a & cmdm:Profile; \\
    188190 & rdfs:label & "collection"; \\
    189191 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
    190 cr:clarin.eu:cr1:c\_1271859438197\#Actor  \\
     192cmd:Actor  \\
    191193& a &cmdm:Component. \\
    192194\end{example3}
    193195
    194196\commentx{Menzo: we need more context for inner components. In the example LanguageName looks well defined, but take a Component/Element like Title. Is it the title of a book or the title of a person. Only when the semantics are clear, e.g., with a dcr:datcat, one can ignore the context and collapse all Components/Elements to a single RDF class/property.}
    195 
    196 \begin{notex}
    197 Menzo: inner components don't have IDs so I propose a path build from the context up to a shareable component (we need some nice term for that, in the TDS I called it a top notion so maybe a top component. The cmd prefix also needs to be bound to a component specific URI. This URI contains the top component ID, e.g., \furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}.
    198 \end{notex}
    199 \commentx{Matej: see above - treated sufficiently?}
    200197
    201198\subsection{Data Categories}
     
    217214
    218215
    219 \subsection{RELcat - Ontological relations}
    220 \commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.}
    221 
    222 As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat}. The relations here are grouped into relation sets and stored as RDF triples \cite{SchuurmanWindhouwer2011}. A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
     216%\subsection{RELcat - Ontological relations}
     217% \commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.}
     218
     219As described in \ref{CMDI}, relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in a dedicated module the Relation Registry \xne{RELcat} directly as RDF triples \cite{SchuurmanWindhouwer2011} with dedicated predicates (like \code{rel:*}). In the final paper, we will provide more details on the role of this important building block in the endeavour.
     220
     221\begin{comment}
     222A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
    223223
    224224\begin{example3}
     
    232232rel:sameAs & rdfs:subPropertyOf & owl:sameAs
    233233\end{example3}
    234 
    235 \commentx{Menzo: I would use owl:sameAs rdfs:subPropertyOf rel:sameAs. I see the rel:* properties as an upper layer of a taxonony of relation types. The RELcat types are loose and the OWL ones specific, hence the subtyping. In RELcat you might also query multiple graphs with multiple vocabularies various 'same-as' properties then still need to be distinguishable but the general rel:sameAs need to be created.}
    236 
     234\end{comment}
    237235
    238236%%%%%%%%%%%%%%%%%%%%%
     
    243241with \code{cmdm:hasResourceType} and  \code{cmdm:hasMimeType} predicates to type the resources.
    244242
    245 \commentx{not sure about the use of \code{cmdm:Resource}. both as rdf:type and :hasResourcetype? - see comment in general.ttl}
    246 
    247243\begin{example3}
    248244<lr1> & a  & cmdm:Resource; \\
    249245& cmdm:hasResourceType & cmdm:Resource. \\
    250 <lr1.cmd> & a  & cmdm:Resource; \\
    251 & cmdm:hasResourceType & cmdm:Metadata. \\
    252246\end{example3}
    253247
     
    257251If identifiers are present for both resource and metadata, the relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
    258252(Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation).
    259 
    260 \commentx{Menzo: also there can be multiple resource proxies. Maybe we can use an RDF list?}
    261253
    262254\begin{example3}
     
    325317As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
    326318
    327 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity. The mapping process is detailed in next section \ref{sec:values2entities}.
     319While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity.
     320
     321The actual mapping process from values to entities is a complex challenging task and will be tackled in more detail in the full paper. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
     322
    328323
    329324\begin{example3}
     
    350345
    351346
     347\begin{comment}
    352348%%%%%%%%%%%%%%%%%
    353349\section{Mapping field values to semantic entities}
     
    414410%One example: A lookup with the pair \code{<organization, "Academy of sciences">} would probably return a list of organizations, as there is a national Academy of Sciences, in a number of countries. It would require further heuristics, e.g. checking the corresponding department, contact or -- less reliably -- the language of the described resource, to determine which specific Academy of Sciences is meant in given resource description.
    415411
     412\end{comment}
     413
    416414\section{Implementation}
    417415
     
    424422
    425423\section{Conclusions and Future Work}
    426 In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema and the mapping of element values to semantic entities. Additionally, some technical considerations were discussed regarding exposing this dataset as Linked Open Data.
     424In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the full paper we will also elaborate on the task of mapping element values to semantic entities. Additionally, some technical considerations will be discussed regarding exposing this dataset as Linked Open Data.
    427425
    428426With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e. the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked to.
Note: See TracChangeset for help on using the changeset viewer.