Changeset 3843 for CMDI-Interoperability


Ignore:
Timestamp:
10/22/13 18:23:50 (11 years ago)
Author:
mwindhouwer
Message:

M 2014-LREC/CMD2RDF.tex

  • filled the CMDI section
  • some other stuff

(Greetings from Frankfurt airport :-)

File:
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3837 r3843  
    8181\section{Motivation}
    8282%
    83 Although semantic interoperability has been one of the main motivation for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data  as Linked Open Data linked with external semantic resources, will opens a whole new level of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
     83Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data  as Linked Open Data linked with external semantic resources, will opens a whole new level of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
    8484%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    8585
     
    8787\section{The Component Metadata Infrastructure}\label{CMDI}
    8888%
    89 ?
     89
     90The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centers offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observer. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata elements and terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure.
     91
     92\commentx{Menzo: would be nice to include one of the UML diagrams.}
    9093
    9194%
    9295\subsection{Current status of the joint CMD Domain}
    9396%
    94 To provide a frame of reference for the proportions of the undertaking in the following section, a few numbers about the data in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
     97To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
    9598
    9699\subsubsection{CMD Profiles }
    97 In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
    98 
     100In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined.
    99101Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements.
    100102
     
    106108On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    107109
    108 
    109110%
    110111\section{LOD -- Linked Open Data}
     
    114115in linking linguistic data \cite{ldl2012}, that renders an obvious pool of candidate
    115116datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.
    116 Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. with the ISO-639-3 language identifiers, as they are used in CMD data.
     117Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. for the ISO-639-3 language identifiers which  are also used in CMD records.
    117118\xne{lexvo} also seems suitable as it is already linked with a number of LDL datasets among others \xne{WALS}, \xne{lingvoj}, \xne{Glottolog}.
    118119Of course, language is just one dimension to use for linking/mapping.
     
    121122but also domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
    122123
    123 
    124124\section{CMD to RDF}
    125125\label{sec:cmd2rdf}
    126 In the following, RDF encoding is proposed for all levels of the CMD data domain:
    127 
     126In the following a RDF encoding is proposed for all levels of the CMD data domain:
    128127\begin{itemize}
    129128\item CMD meta model
     
    135134\subsection{CMD specification}
    136135
    137 The main entity of the meta model is the CMD component modelled as \code{rdfs:Class}. CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:ElementEntity}. The attributes are modelled analogously with \code{cmdm:Attribute, cmdm:AttributeValue, cmdm:AttributeEntity}.
     136The main entity of the meta model is the CMD component modelled as A \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:hasElementEntity}. The attributes are modelled analogously with \code{cmdm:Attribute, cmdm:hasAttributeValue, cmdm:hasAttributeEntity}.
    138137
    139138The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}, attributes of individual components and elements are bound with \code{cmdm:containsAttribute}.
     
    173172              & rdfs:range & :Entity .   \\
    174173 \\
     174\multicolumn{3}{l}{\# analogue for attributes ...}  \\
    175175%cmdm:hasAttributeValue & a & rdf:Property ;  \\
    176176%              & rdfs:domain & cmdm:Attribute ;  \\
Note: See TracChangeset for help on using the changeset viewer.