Changeset 4833 for CMDI-Interoperability


Ignore:
Timestamp:
03/28/14 09:39:11 (10 years ago)
Author:
Menzo Windhouwer
Message:

M 2014-LREC/CMD2RDF.bib

  • added Thorsten's paper

M 2014-LREC/CMD2RDF.tex

  • fixed some minor issues
  • replace <lr0.agg> by _:agg0
  • extended section 5
Location:
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.bib

    r4463 r4833  
    619619}
    620620
     621@incollection{Zinn+2012,
     622year={2012},
     623isbn={978-3-642-30283-1},
     624booktitle={The Semantic Web: Research and Applications},
     625volume={7295},
     626series={Lecture Notes in Computer Science},
     627editor={Simperl, Elena and Cimiano, Philipp and Polleres, Axel and Corcho, Oscar and Presutti, Valentina},
     628doi={10.1007/978-3-642-30284-8_26},
     629title={The ISOcat Registry Reloaded},
     630url={http://dx.doi.org/10.1007/978-3-642-30284-8_26},
     631publisher={Springer Berlin Heidelberg},
     632author={Zinn, Claus and Hoppermann, Christina and Trippel, Thorsten},
     633pages={285-299}
     634}
     635
    621636@comment{jabref-meta: selector_publisher:}
    622637
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r4829 r4833  
    7272
    7373\abstract{In the European CLARIN infrastructure a growing number of resources are described with Component Metadata. In this paper we
    74 describe a transformation to make this metadata available as linked data. After this first step it becomes possible to connect the CLARIN Component Metadata with other valuable knowledge sources in the Linked Data Cloud. \\ \newline \Keywords{Linked Open Data, RDF, metadata}}
     74describe a transformation to make this metadata available as linked data. After this first step it becomes possible to connect the CLARIN Component Metadata with other valuable knowledge sources in the Linked Data Cloud. \\ \newline \Keywords{Linked Open Data, RDF, component metadata}}
    7575
    7676%
     
    107107
    108108\subsubsection{CMD Profiles }
    109 Currently 146 public profiles and 857 components are defined in the CR.
     109Currently\footnote{All numbers are as of 2014-03.} 153 public profiles and 859 components are defined in the CR.
    110110Next to the `native' ones a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema.
    111111%The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel.
    112 The individual profiles differ also very much in their structure -- next to simple flat profiles
     112The individual profiles also differ very much in their structure -- next to simple flat profiles
    113113%with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles)
    114114there are complex ones with up to 10 levels %(\textit{ExperimentProfile}, profiles for describing Web Services)
     
    119119
    120120The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
    121 regularly collects records from the -- currently 58 -- providers, all in all over 600.000 records.
    122 Some 20 of the providers offer CMDI records, the rest provides around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile.
     121regularly collects records from the -- currently 56 -- providers, all in all over 600.000 records.
     122Some 20 of the providers offer CMDI records, the rest provides around 44.000 OLAC/DC records, that are converted into the corresponding CMD profile.
    123123%Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
    124124%On the other hand, some
    125 Some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that overall instance data for more than 60 profiles is present.
     125Some of the providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that overall instance data for more than 60 profiles is present.
    126126%So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    127127
     
    138138\subsection{CMD specification}\label{sec:CMDM}
    139139
    140 The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class} (see Figure \ref{fig:CMDM-RDF}). A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes,\footnote{Although the encoding has been done, due to space considerations, we will not further discuss attributes.} relation to the containing component)  it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
     140The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class} (see Figure \ref{fig:CMDM-RDF}). A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity, i.e., attributes,\footnote{Although the modelling work has been done, due to space considerations, we will not further discuss attributes.} it too has to be expressed as a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:hasElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
    141141
    142142\begin{figure*}
     
    192192\subsection{CMD profile and component definitions}
    193193These top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
    194 For stand-alone components, the IRI is the (future) path into the CR to get the RDF representation for the profile/component.\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438125/rdf} For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})
     194For stand-alone components, the IRI is the (future) path into the CR to get the RDFS representation for the profile/component.\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438125/rdf} For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})
    195195
    196196\begin{example2}
     
    204204
    205205\subsubsection{Data Categories}
    206 The primary concept registry in use by CMDI for its concept links is ISOcat. The recommended approach to link to the data categories is via an annotation property \cite{Windhouwer2012_LDL}.
     206The primary concept registry in use by CMDI is ISOcat. The recommended approach to link to the data categories is via an annotation property \cite{Windhouwer2012_LDL}.
    207207
    208208\begin{example2}
     
    221221\end{example2}
    222222
     223Lateron, this information can be used, e.g., in combination with ontological relationships for these data categories available in the RELcat Relation Registry \cite{WINDHOUWER12.954}, to map to other vocabularies.
     224
    223225%\subsection{RELcat - Ontological relations}
    224226% \commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.}
     
    289291\begin{example3}
    290292\textless lr0.cmd \textgreater  & a   & ore:ResourceMap . \\
    291 \textless lr0.cmd\textgreater & ore:describes & \textless  lr0.agg\textgreater . \\
    292 \textless lr0.agg\textgreater & a   & ore:Aggregation ; \\
     293\textless lr0.cmd\textgreater & ore:describes & \_:agg0 . \\
     294\_:agg0 & a   & ore:Aggregation ; \\
    293295& ore:aggregates  & \textless lr1.cmd\textgreater, \textless lr2.cmd\textgreater . \\
    294296\end{example3}
     
    308310       
    309311\subsubsection{Components -- nested structures}
    310 For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
     312For expressing the tree structure of the CMD records, i.e., the containment relation between the components, a dedicated property \code{cmd:contains} is used:
    311313
    312314\begin{example3}
     
    359361
    360362\subsubsection{Elements, Fields, Values}\label{sec:values}
    361 Finally, we want to integrate also the actual field values in the CMD records into the linked data.
    362 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related by a \code{cmdm:hasElementValue} property.
     363Finally, we want to integrate also the actual fvalues in the CMD records into the linked data.
     364As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmdm:ElementValue}, and they are related by a \code{cmdm:hasElementValue} property.
    363365
    364366While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. The example in Figure \ref{fig:final-example} shows the whole chain of statements from metamodel to literal value and corresponding semantic entity.
     
    442444The main added value of LOD \cite{TimBL2006} is the interconnecting of disparate datasets in the so called LOD cloud \cite{Cyganiak2010}.
    443445
    444 The actual mapping process from CMDI values (see Section \ref{sec:values}) to entities is a complex and challenging task. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
    445 
    446 In the broader context of LOD Cloud there is the Open Knowledge Foundation’s Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate
     446The actual mapping process from CMDI values (see Section \ref{sec:values}) to entities is a complex and challenging task. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links. Within CMDI the SKOS-based vocabulary service CLAVAS,\furl{https://openskos.meertens.knaw.nl/} which will be supported in the upcoming new version of CMDI, can be used as a starting point, e.g., for organisations. In the broader context of LOD Cloud there is the Open Knowledge Foundation’s Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate
    447447datasets to link the CMD data with.\furl{http://linguistics.okfn.org/resources/llod/}  Within these \xne{lexvo} seems a most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
    448448\xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}.
     
    452452but also to domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
    453453
     454Next to entities also predicates can be shared across datasets. The CMD Infrastructure already provides facilities in the form of ISOcat and RELcat. RELcat, for example, has already sets to relate data categories to Dublin Core terms. This can be extended with the ontology for metadata concepts described in \cite{Zinn+2012}, which does not provide common predicates but would allow to do more generic or specific searches.
     455
    454456\section{Conclusions}
    455457In this paper, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the future we will extend this with mapping element values to semantic entities.
Note: See TracChangeset for help on using the changeset viewer.