Changeset 3872 for CMDI-Interoperability


Ignore:
Timestamp:
10/24/13 10:48:52 (11 years ago)
Author:
mwindhouwer
Message:

M 2014-LREC/CMD2RDF.tex

  • made another pass leading to small fixes here and there
  • word count: 1925
File:
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3871 r3872  
    8181\section{Motivation}
    8282%
    83 Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
     83Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure (CMDI), until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data (LOD) interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
    8484%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    8585
     
    8888%
    8989
    90 The basic building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. Components are stored in the Component Registry (CR), where they can be reused by other modellers. Thus a metadata modeller selects or creates components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile is a blueprint of a schema for a metadata record. CLARIN centres offer these CMD records describing their resources to the joint metadata domain. There are a number of generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used concept registries are the Dublin Core metadata %elements and
     90The basic building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components (see Figure \ref{fig:CMDM}). Components are stored in the Component Registry (CR), where they can be reused by other modellers. Thus a metadata modeller selects or creates components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile is a blueprint of a schema for a metadata record. CLARIN centres offer these CMD records describing their resources to the joint metadata domain. There are a number of generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory\furl{http://www.clarin.eu/vlo/}. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used concept registries are the Dublin Core metadata %elements and
    9191terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use this semantic linkage to overcome differences in terminology and also in structure.
    9292
     
    12212216 of the providers offer CMDI records, the other 53 provide around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile.
    123123%Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
    124 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that overall instance data for more than 60 profiles is present.
     124%On the other hand, some
     125Some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that overall instance data for more than 60 profiles is present.
    125126%So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    126127
     
    132133datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.
    133134Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
    134 \xne{lexvo} also seems suitable as it is already linked with a number of LDL datasets among others \xne{WALS}, \xne{lingvoj}, \xne{Glottolog}.
     135\xne{lexvo} also seems suitable as it is already linked with a number of other LOD linguistic datasets like \xne{WALS}, \xne{lingvoj} and \xne{Glottolog}.
    135136Of course, language is just one dimension to use for mapping.
    136137Step by step we will link other categories like countries, geographica, organisations, etc.
     
    142143In the following a RDF encoding is proposed for all levels of the CMD data domain:
    143144\begin{itemize}
    144 \item CMD meta model
    145 \item profile definitions
    146 \item the administrative and structural information of CMD records
    147 \item individual values in the fields of the CMD records
     145\item the CMD meta model,
     146\item the profile definitions,
     147\item the administrative and structural information of CMD records, and
     148\item the individual values in the fields of the CMD records.
    148149\end{itemize}
    149150
    150151\subsection{CMD specification}\label{sec:CMDM}
    151152
    152 The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes\footnote{Due to space considerations the abstract will not further discuss attributes.}, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
     153The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g., attributes\footnote{Due to space considerations the abstract will not further discuss attributes.}, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
    153154
    154155\label{table:rdf-spec}
     
    197198
    198199\noindent
    199 This top-level are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
     200This top-level classes and properties are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
    200201For stand-alone components, the IRI is the exact path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf} . For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and a dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}.}.)
    201202
     
    272273\subsubsection{Provenance}
    273274
    274 The information from CMD record \code{cmd:Header} represents the provenance information about the modelled data:
     275The information from the CMD record \code{cmd:Header} represents the provenance information about the modelled data.
    275276
    276277\begin{example3}
     
    283284\subsubsection{Collection hierarchy}  % ( Resource Proxy – IsPartOf)}
    284285
    285 In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations} (The links to resources are handled by \code{oa:hasTarget}.)
     286In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as an \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}. (The links to resources are handled by \code{oa:hasTarget}.)
    286287:
    287288
     
    326327
    327328\subsubsection{Elements, Fields, Values}
    328 Finally, we want to integrate also the actual field values in the CMD records into the ontology.
    329 As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue} property and the corresponding data category expressed as annotation property.
     329Finally, we want to integrate also the actual field values in the CMD records into the linked data.
     330As explained before, CMD elements have to be typed as \code{rdfs:Class}, the actual value expressed as \code{cmds:ElementValue}, and they are related bua a \code{cmdm:hasElementValue} property.
    330331
    331332While generating triples with literal values seems straightforward, the more challenging but also more valuable aspect is to generate object property triples (predicate \code{cmdm:hasElementEntity}) with the literal values mapped to semantic entities. Following example shows the whole chain of statements from metamodel to literal value and corresponding semantic entity.
     
    344345        & rdfs:domain & cmd:Organisation ; \\
    345346        & rdfs:range & cmd:OrganisationElementEntity .\\
     347cmd:OrganisationElementEntity \\
     348& a & cmdm:Entity . \\
    346349\\
    347350\multicolumn{3}{l}{\# person (mentioned in a MD record) has an affiliation (cmd:Person/cmd:Organisation) } \\
     
    350353\_:org & a & cmd:Person.Organisation ; \\
    351354        & \multicolumn{2}{l}{cmd:hasOrganisationElementValue \quad 'MPI'\^{}\^{}xs:string ;} \\
    352         & \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity  \quad <http://mpi.nl> . }\\
    353 
    354 <http://mpi.nl> & a  & cmd:OrganisationElementEnity .
     355        & \multicolumn{2}{l}{ cmd:hasOrganisationElementEntity  \quad <http://www.mpi.nl/> . }\\
     356
     357<http://www.mpi.nl/> & a  & cmd:OrganisationElementEnity .
    355358\end{example3}
    356359
     
    434437In this abstract, we sketched the work on encoding of the whole of the CMD data domain in RDF, with special focus on the core model -- the general component schema. In the full paper we will also elaborate on the task of mapping element values to semantic entities. Additionally, some technical considerations will be discussed regarding exposing this dataset as Linked Open Data.
    435438
    436 With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e. the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked to.
     439With this new enhanced dataset, the groundwork is laid for a full-blown \emph{semantic search}, i.e., the possibility of exploring the dataset indirectly using external semantic resources (like vocabularies of organizations or taxonomies of resource types) to which the CMD data will then be linked.
    437440
    438441\bibliographystyle{splncs}
Note: See TracChangeset for help on using the changeset viewer.