Changeset 4829 for CMDI-Interoperability


Ignore:
Timestamp:
03/26/14 08:27:31 (10 years ago)
Author:
Menzo Windhouwer
Message:

M 2014-LREC/CMD2RDF.tex

  • small fixes as suggested by reviewers
File:
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r4463 r4829  
    7171               matej.durco@oeaw.ac.at, menzo.windhouwer@dans.knaw.nl\\}
    7272
    73 \abstract{In the european CLARIN infrastructure a growing number of resources are described with Component Metadata. In this paper we
     73\abstract{In the European CLARIN infrastructure a growing number of resources are described with Component Metadata. In this paper we
    7474describe a transformation to make this metadata available as linked data. After this first step it becomes possible to connect the CLARIN Component Metadata with other valuable knowledge sources in the Linked Data Cloud. \\ \newline \Keywords{Linked Open Data, RDF, metadata}}
    7575
     
    8181\section{Motivation}
    8282%
    83 Although semantic interoperability has been one of the main motivations for CLARIN's Component Metadata Infrastructure (CMDI) \cite{Broeder+2010} \furl{http://www.clarin.eu/cmdi/}, until now there has been no work on the obvious -- bringing CMDI to the Semantic Web. We believe that providing the CLARIN CMD records as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD data domain can be expressed in RDF and made ready to be interlinked with existing external semantic resources (ontologies, taxonomies, knowledge bases,  vocabularies).
     83Although semantic interoperability has been one of the main motivations for CLARIN's Component Metadata Infrastructure (CMDI) \cite{Broeder+2010},\furl{http://www.clarin.eu/cmdi/} until now there has been no work on the obvious -- bringing CMDI to the Semantic Web. We believe that providing the CLARIN CMD records as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD data domain can be expressed in RDF and made ready to be interlinked with existing external semantic resources (ontologies, taxonomies, knowledge bases,  vocabularies).
    8484%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    8585
     
    8888%
    8989
    90 The basic building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components (see Figure \ref{fig:CMDM}). Components are stored in the Component Registry (CR), where they can be reused by other modellers. Thus a metadata modeller selects or creates components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile serves as blueprint for a schema for metadata records. CLARIN centres offer these CMD records describing their resources to the joint metadata domain. There are a number of generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory\furl{http://www.clarin.eu/vlo/}. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used concept registries are the Dublin Core metadata %elements and
     90The basic building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components (see Figure \ref{fig:CMDM}). Components are stored in the Component Registry (CR), where they can be reused by other modellers. Thus a metadata modeller selects or creates components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile serves as blueprint for a schema for metadata records. CLARIN centres offer these CMD records describing their resources to the joint metadata domain. There are a number of generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory.\furl{http://www.clarin.eu/vlo/} These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used concept registries are the Dublin Core metadata %elements and
    9191terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use this semantic linkage to overcome differences in terminology and also in structure.
    9292
     
    130130In the following a RDF encoding is proposed for all levels of the CMD data domain:
    131131\begin{itemize}
    132 \item CMD meta model,
     132\item CMD meta model (see Figure \ref{fig:CMDM}),
    133133\item profile and component definitions,
    134134\item administrative and structural information of CMD records and
     
    138138\subsection{CMD specification}\label{sec:CMDM}
    139139
    140 The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations we will not further discuss attributes.}, relation to the containing component)  it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
     140The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class} (see Figure \ref{fig:CMDM-RDF}). A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes,\footnote{Although the encoding has been done, due to space considerations, we will not further discuss attributes.} relation to the containing component)  it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
    141141
    142142\begin{figure*}
     
    187187\end{center}
    188188\caption{The CMD meta model in RDF}
    189 \label{fig:final-example}
     189\label{fig:CMDM-RDF}
    190190\end{figure*}
    191191
    192192\subsection{CMD profile and component definitions}
    193 This top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
    194 For stand-alone components, the IRI is the (future) path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})
     193These top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
     194For stand-alone components, the IRI is the (future) path into the CR to get the RDF representation for the profile/component.\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c_1271859438125/rdf} For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})
    195195
    196196\begin{example2}
     
    259259If identifiers are present for both resource and metadata, \end{comment}
    260260The PID of a Language Resource ( \code{<lr1>} ) is used as the IRI for the described resource in the RDF representation.
    261 The relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html}.
     261The relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary.\furl{http://openannotation.org/spec/core/core.html}
    262262(Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation.)
    263263
     
    284284\subsubsection{Collection hierarchy}  % ( Resource Proxy – IsPartOf)}
    285285
    286 In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.)
     286In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}.\furl{http://www.openarchives.org/ore/1.0/primer#Foundations} (The links to resources are handled by \code{oa:hasTarget}.)
    287287:
    288288
     
    333333cmd:Person.Organisation & a &  cmdm:Element . \\
    334334cmd:hasPerson.OrganisationElementValue  \\
    335 & rdfs:subProperyOf & cmdm:hasElementValue ; \\
     335& rdfs:subPropertyOf & cmdm:hasElementValue ; \\
    336336        & rdfs:domain & cmd:Person.Organisation ; \\
    337337        & rdfs:range & xs:string . \\
    338338cmd:hasPerson.OrganisationElementEntity \\
    339         & rdfs:subProperyOf & cmdm:hasElementEntity ; \\
     339        & rdfs:subPropertyOf & cmdm:hasElementEntity ; \\
    340340        & rdfs:domain & cmd:Person.Organisation ; \\
    341341        & rdfs:range & cmd:Person.OrganisationElementEntity .\\
     
    350350        & \multicolumn{2}{l}{ cmd:hasPerson.OrganisationElementEntity  \quad \textless http://www.mpi.nl/\textgreater . }\\
    351351
    352 \textless http://www.mpi.nl/\textgreater & a  & cmd:OrganisationElementEnity .
     352\textless http://www.mpi.nl/\textgreater & a  & cmd:OrganisationElementEntity .
    353353\end{example3}
    354354\end{center}
     
    445445
    446446In the broader context of LOD Cloud there is the Open Knowledge Foundation’s Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate
    447 datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.  Within these \xne{lexvo} seems a most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
     447datasets to link the CMD data with.\furl{http://linguistics.okfn.org/resources/llod/}  Within these \xne{lexvo} seems a most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
    448448\xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}.
    449449Of course, language is just one dimension to use for mapping.
Note: See TracChangeset for help on using the changeset viewer.