Changeset 4451 for CMDI-Interoperability


Ignore:
Timestamp:
02/05/14 20:27:38 (10 years ago)
Author:
mwindhouwer
Message:

M 2014-LREC/CMD2RDF.tex

  • preparing for a LDL short paper

M 2014-LREC/CMD2RDF.bib

  • added CMDI-1 ISO/DIS
Location:
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.bib

    r3868 r4451  
    604604}
    605605
     606@STANDARD{ISODIS24622-1_2013,
     607  title = {Language resource management -- Component Metadata Infrastructure -- Part 1: The Component Metadata Model (CMDI-1)},
     608  organization = {ISO},
     609  author = {{ISO/DIS 24622-1}},
     610  type = {International Standard},
     611  number = {24622-1},
     612  address = {Geneva, Switzerland},
     613  year = {2013},
     614  url = {http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=37336},
     615  abstract = {},
     616  owner = {m},
     617  publisher = {ISO},
     618  timestamp = {2014.02.05}
     619}
     620
     621
    606622@comment{jabref-meta: selector_publisher:}
    607623
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r4428 r4451  
    11\documentclass[10pt, a4paper]{article}
    22\usepackage{lrec2006}
     3
    34\usepackage{color}
    45\usepackage{graphicx}
     
    1415
    1516%%% PAGE DIMENSIONS
    16 \usepackage{geometry} % to change the page dimensions
    17 \geometry{a4paper} % or letterpaper (US) or a5paper or....
    18 \geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
     17%\usepackage{geometry} % to change the page dimensions
     18%\geometry{a4paper} % or letterpaper (US) or a5paper or....
     19%\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
    1920%\topmargin=-0.6in
    20 \textheight=700pt
     21%\textheight=700pt
    2122% \geometry{landscape} % set up the page for landscape
    2223%   read geometry.pdf for detailed page layout information
     
    3233\begin{sffamily} \begin{shaded*} \noindent
    3334 \begin{tabular}{@{\hspace{-1mm}} p{0.3\textwidth}  p{0.7\textwidth} } }
     35{\end{tabular} \end{shaded*} \end{sffamily} }
     36
     37\newenvironment{example2a}
     38{ \footnotesize
     39\begin{sffamily} \begin{shaded*} \noindent
     40 \begin{tabular}{@{\hspace{-1mm}} p{0.4\textwidth}  p{0.6\textwidth} } }
    3441{\end{tabular} \end{shaded*} \end{sffamily} }
    3542
     
    5663{ \end{textit} \normalsize}
    5764
    58 \title{From Component Metadata to Linked Open Data}
     65\title{From CLARIN Component Metadata to Linked Open Data}
    5966
    6067\name{Matej Durco, Menzo Windhouwer}
     
    6471               matej.durco@assoc.oeaw.ac.at, menzo.windhouwer@dans.knaw.nl\\}
    6572
    66 \abstract{To be done ...}
     73\abstract{In the european CLARIN infrastructure a growing number of resources are described with Component Metadata. In this paper we
     74describe a transformation to make this metadata available as linked data. After this first step it becomes possible to connect the CLARIN Component Metadata with other valuable knowledge sources in the Linked Data Cloud. \\ \newline \Keywords{Linked Open Data, RDF, metadata}}
    6775
    6876%
     
    7381\section{Motivation}
    7482%
    75 Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure (CMDI), until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
     83Although semantic interoperability has been one of the main motivations for CLARIN's Component Metadata Infrastructure (CMDI), until now there has been no work on the obvious -- bringing CMDI to the Semantic Web. We believe that providing the CLARIN CMD records as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD Infrastructure can be expressed in RDF and made ready to be interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
    7684%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    7785
     
    8795\hspace{-0.1\textwidth}\includegraphics[width=0.8\textwidth]{CMDM}
    8896\end{center}
    89 \caption{Component Metadata Model}
     97\caption{Component Metadata Model \cite{ISODIS24622-1_2013}}
    9098\label{fig:CMDM}
    9199\end{figure*}
     
    118126%So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    119127
    120 %
    121 \section{LOD -- Linked Open Data}
    122 %
    123 The main added value of LOD\cite{TimBL2006} is the interconnecting of disparate datasets.
    124 In the broader context of LOD there is meanwhile an Open Knowledge Foundation’s Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate
    125 datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.
    126 Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
    127 \xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}.
    128 Of course, language is just one dimension to use for mapping.
    129 Step by step we will link other categories like countries, geographica, organisations, etc.
    130 to some of the central nodes of the LOD cloud \cite{Cyganiak2010}, like \xne{dbpedia}, \xne{Yago} or \xne{geonames},
    131 but also to domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
    132 
    133128\section{CMD to RDF}
    134129\label{sec:cmd2rdf}
     
    143138\subsection{CMD specification}\label{sec:CMDM}
    144139
    145 The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations the abstract will not further discuss attributes.}, relation to the containing component)  it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
     140The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations we will not further discuss attributes.}, relation to the containing component)  it too has to be expressed as \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (constituting outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
    146141
    147142\begin{figure*}
     
    160155cmdm:contains & a & rdf:Property ; \\
    161156        & rdfs:domain & cmdm:Component ; \\
    162         & rdfs:range & :Component , :Element . \\
     157        & rdfs:range & cmdm:Component , cmdm:Element . \\
    163158
    164159%cmdm:containsAttribute & a &rdf:Property;
    165 %          & rdfs:domain & :Component, :Element;
    166 %          & rdfs:range & :Attribute.
     160%          & rdfs:domain & cmdm:Component, cmdm:Element;
     161%          & rdfs:range & cmdm:Attribute.
    167162
    168163\multicolumn{3}{l}{\# values}  \\
     
    178173\\
    179174cmdm:hasElementEntity & a & rdf:Property ;  \\
    180               & rdfs:domain & :Element ;  \\
    181               & rdfs:range & :Entity .   \\
     175              & rdfs:domain & cmdm:Element ;  \\
     176              & rdfs:range & cmdm:Entity .   \\
    182177% \\
    183178%\multicolumn{3}{l}{\# analogue for attributes ...}  \\
     
    187182
    188183%cmdm:hasAttributeEntity & a & rdf:Property ;  \\
    189 %              & rdfs:domain & :Attribute ;  \\
    190 %              & rdfs:range & :Entity .  \\
     184%              & rdfs:domain & cmdm:Attribute ;  \\
     185%              & rdfs:range & cmdm:Entity .  \\
    191186\end{example3}
    192187\end{center}
     
    197192\subsection{CMD profile and component definitions}
    198193This top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
    199 For stand-alone components, the IRI is the exact path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})
     194For stand-alone components, the IRI is the (future) path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf}. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and the dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf \#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}})
    200195
    201196\begin{example2}
    202197cmd:collection \\
    203 $\;$ a & cmdm:Profile; \\
    204 $\;$ rdfs:label & "collection"; \\
    205 $\;$ dc:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
     198$\;$ a & cmdm:Profile ; \\
     199$\;$ rdfs:label & "collection" ; \\
     200$\;$ dc:identifier & cr:clarin.eu:cr1:p\_1345561703620 . \\
    206201cmd:Actor \\
    207 $\;$ a &cmdm:Component. \\
     202$\;$ a &cmdm:Component . \\
    208203\end{example2}
    209204
     
    214209dcr:datcat \\
    215210$\;$ a  & owl:AnnotationProperty ; \\
    216 $\;$ rdfs:label  & "data category"@en ; \\
     211$\;$ rdfs:label  & "data category"@en . \\
    217212% & rdfs:comment  & "This resource is equivalent to  this data category."@en ; \\
    218213% & skos:note  & "The data category should be identified by its PID."@en ; \\
     
    223218\begin{example2}
    224219cmd:LanguageName \\
    225 $\;$ dcr:datcat & isocat:DC-2484. \\
     220$\;$ dcr:datcat & isocat:DC-2484 . \\
    226221\end{example2}
    227222
     
    254249\begin{example3}
    255250\textless lr1\textgreater \\
    256 $\enspace \,$ a  & & cmdm:Resource; \\
    257 \multicolumn{2}{l}{cmdm:hasMimeType } & "audio/wav". \\
     251$\enspace \,$ a  & & cmdm:Resource ; \\
     252\multicolumn{2}{l}{cmdm:hasMimeType } & "audio/wav" . \\
    258253\end{example3}
    259254
     
    267262(Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation.)
    268263
    269 \begin{example2}
     264\begin{example2a}
    270265\_:anno1  \\
    271266 $\:$ a & oa:Annotation ; \\
    272267 $\:$ oa:hasTarget  & \textless lr1a \textgreater, \textless lr1b\textgreater ; \\
    273268 $\:$ oa:hasBody  & \_:topComponent1 ; \\
    274   $\:$ oa:motivatedBy & oa:describing . \\
    275 \end{example2}
     269 $\:$ oa:motivatedBy & oa:describing . \\
     270\end{example2a}
    276271
    277272\subsubsection{Provenance}
     
    282277\_:topComponent1  \\
    283278 $\:$ dc:identifier  & \textless lr1.cmd \textgreater ;  \\
    284  $\:$ dc:creator  & \var{\{cmd:MdCreator\}} ;  \\
     279 $\:$ dc:creator  & "John Doe" ;  \\
    285280 $\:$ dc:publisher  & \textless http://clarin.eu\textgreater  ; \\
    286  $\:$ dc:created & \var{\{cmd:MdCreated\}} .  \\
     281 $\:$ dc:created & "2014-02-05"\^{}\^{}xs:date .  \\
    287282\end{example2}
    288283
    289284\subsubsection{Collection hierarchy}  % ( Resource Proxy – IsPartOf)}
    290285
    291 In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.)
     286In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.)
    292287:
    293288
     
    317312\begin{example3}
    318313\_:actor1  & a & cmd:Actor . \\
    319 \_:actor1lang1  & a & cmd:Actor \\
    320  &  & .Actor\_Language . \\
     314\_:actor1lang1  & a & cmd:Actor. \\
     315 &  & Actor\_Language . \\
    321316\_:actor1  & cmd:contains & \_:actor1lang1 . \\
    322317\end{example3}
     
    338333cmd:Person & a & cmdm:Component . \\
    339334cmd:Person.Organisation & a &  cmdm:Element . \\
    340 cmd:hasOrganisationElementValue  \\
     335cmd:hasPerson.OrganisationElementValue  \\
    341336& rdfs:subProperyOf & cmdm:hasElementValue ; \\
    342         & rdfs:domain & cmd:Organisation ; \\
     337        & rdfs:domain & cmd:Person.Organisation ; \\
    343338        & rdfs:range & xs:string . \\
    344 cmd:hasOrganisationElementEntity \\
     339cmd:hasPerson.OrganisationElementEntity \\
    345340        & rdfs:subProperyOf & cmdm:hasElementEntity ; \\
    346         & rdfs:domain & cmd:Organisation ; \\
    347         & rdfs:range & cmd:OrganisationElementEntity .\\
    348 cmd:OrganisationElementEntity \\
     341        & rdfs:domain & cmd:Person.Organisation ; \\
     342        & rdfs:range & cmd:Person.OrganisationElementEntity .\\
     343cmd:Person.OrganisationElementEntity \\
    349344& a & cmdm:Entity . \\
    350345\\
     
    353348        & cmdm:contains & \_:org . \\
    354349\_:org & a & cmd:Person.Organisation ; \\
    355         & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
    356         & \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity  \quad <http://www.mpi.nl/> . }\\
     350        & \multicolumn{2}{l}{cmd:hasPerson.OrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
     351        & \multicolumn{2}{l}{ cmd:hasPerson.OrganisationElementEntity  \quad <http://www.mpi.nl/> . }\\
    357352
    358353<http://www.mpi.nl/> & a  & cmd:OrganisationElementEnity .
     
    364359
    365360
    366 \subsubsection{Elements, Fields, Values}
     361\subsubsection{Elements, Fields, Values}\label{sec:values}
    367362Finally, we want to integrate also the actual field values in the CMD records into the linked data.
    368 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related bua a \code{cmdm:hasElementValue} property.
    369 
    370 While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity.
    371 
    372 The actual mapping process from values to entities is a complex challenging task and will be tackled in more detail in the full paper. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
     363As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related by a \code{cmdm:hasElementValue} property.
     364
     365While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. The example in Figure \ref{fig:final-example} shows the whole chain of statements from metamodel to literal value and corresponding semantic entity.
     366
    373367
    374368\begin{comment}
     
    441435\section{Implementation}
    442436
    443 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. Once ready, they will be integrated into the CMDI core infrastructure, e.g., the CR.
    444 %And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed.
    445 Once the linked data is available it has to be stored and published in a RDF triple store. We will elaborate on this aspect further in the final paper.
     437The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets. In the future, when the mapping has been tested extensively, they will be integrated into the CMD core infrastructure, e.g., the CR. A linked data representation of the CLARIN joint metadata domain can then be stored in a RDF triple store and exposed via a SPARQL endpoint.
    446438%The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
    447439
    448440% Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
    449441
    450 \section{Conclusions and Future Work}
    451 In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the full paper we will also elaborate on the task of mapping element values to semantic entities. Additionally, some technical considerations will be discussed regarding exposing this dataset as Linked Open Data.
     442%
     443\section{CMDI's future in the LOD Cloud}
     444%
     445The main added value of LOD \cite{TimBL2006} is the interconnecting of disparate datasets in the so called LOD cloud \cite{Cyganiak2010}.
     446
     447The actual mapping process from CMDI values (see Section \ref{sec:values}) to entities is a complex and challenging task. The main idea is to find entities in selected reference datasets (controlled vocabularies, ontologies) corresponding to the literal values in the metadata records. The obtained entity identifiers are further used to generate new RDF triples, representing outbound links.
     448
     449In the broader context of LOD Cloud there is the Open Knowledge Foundation’s Working Group on Linked Data in Linguistics, that represents an obvious pool of candidate
     450datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.  Within these \xne{lexvo} seems a most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
     451\xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}.
     452Of course, language is just one dimension to use for mapping.
     453Step by step we will link other categories like countries, geographica, organisations, etc.
     454to some of the central nodes of the LOD cloud , like \xne{dbpedia}, \xne{Yago} or \xne{geonames},
     455but also to domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
     456
     457\section{Conclusions}
     458In this paper, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the future we will extend this with mapping element values to semantic entities.
    452459
    453460With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e., the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked.
Note: See TracChangeset for help on using the changeset viewer.