Changeset 3868 for CMDI-Interoperability


Ignore:
Timestamp:
10/23/13 20:56:43 (11 years ago)
Author:
vronk
Message:

comment relcat, minor reformulations

Location:
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.bib

    r3864 r3868  
    214214}
    215215
    216 %  booktitle = {Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)},
    217 %  publisher = {European Language Resources Association (ELRA)},
    218216@INPROCEEDINGS{Joerg2010,
    219   author = {Brigitte Jörg and Hans Uszkoreit and Alastair Burt},
     217  author = {Brigitte J\'{o}rg and Hans Uszkoreit and Alastair Burt},
    220218  title = {LT World: Ontology and Reference Information Portal},
    221219  booktitle = {LREC},
    222220  year = {2010},
    223   editor = {Nicoletta Calzolari and Khalid Choukri and Bente Maegaard and Joseph
    224         Mariani and Jan Odjik and Stelios Piperidis and Mike Rosner and Daniel
    225         Tapias},
     221  editor = {Nicoletta Calzolari and Khalid Choukri et al.},
    226222  address = {Valletta, Malta},
    227223  month = {May},
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3862 r3868  
    8181\section{Motivation}
    8282%
    83 Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data  as Linked Open Data linked with external semantic resources, will opens a whole new level of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
     83Although semantic interoperability has been one of the main motivations for CLARIN Component Metadata Infrastructure, until now there has been no work on the obvious -- bringing CMDI to Semantic Web. We believe that providing the whole of CMD data as Linked Open Data interlinked with external semantic resources, will open up new dimensions of processing and exploring of the CMD data by employing the power of semantic technologies. In this paper, we lay out how individual parts of the CMD framework can be expressed in RDF and interlinked with existing external semantic resources (ontologies, knowledge bases, vocabularies).
    8484%This conversion lays a foundation / is groundwork for providing the original dataset as a \emph{Linked Open Data} nucleus within the \emph{Web of Data}\cite{TimBL2006} as well as for real semantic (ontology-driven) search and exploration of the data.
    8585
     
    8888%
    8989
    90 The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centres offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata %elements and
    91 terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure.
     90The basic building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. Components are stored in the Component Registry (CR), where they can be reused by other modellers. Thus a metadata modeller selects or creates components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile is a blueprint of a schema for a metadata record. CLARIN centres offer these CMD records describing their resources to the joint metadata domain. There are a number of generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used concept registries are the Dublin Core metadata %elements and
     91terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use this semantic linkage to overcome differences in terminology and also in structure.
    9292
    9393\begin{figure*}
     
    107107
    108108\subsubsection{CMD Profiles }
    109 In the CR 133 public profiles and 772 components are defined.
    110 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema.
     109Currently 133 public profiles and 772 components are defined in the CR.
     110Next to the `native' ones a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema.
    111111%The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel.
    112112The individual profiles differ also very much in their structure -- next to simple flat profiles
     
    119119
    120120The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
    121 regularly collects records from the providers -- currently 69 over 550.000 records.
     121regularly collects records from the -- currently 69 -- providers, all in all over 550.000 records.
    12212216 of the providers offer CMDI records, the other 53 provide around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile.
    123123%Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
    124 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that all in all there is instance data for more than 60 profiles.
     124On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that overall instance data for more than 60 profiles is present.
    125125%So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    126126
     
    129129%
    130130The main added value of LOD\cite{TimBL2006} is the interconnecting of disparate datasets.
    131 In the broader context of LOD, there is meanwhile a Open Knowledge Foundation’s Working Group on Open Data in Linguistics, that renders an obvious pool of candidate
     131In the broader context of LOD, there is meanwhile an Open Knowledge Foundation’s Working Group on Open Data in Linguistics, that represents an obvious pool of candidate
    132132datasets to link the CMD data with\footnote{\url{http://linguistics.okfn.org/resources/llod/}}.
    133 Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. for the ISO-639-3 language identifiers which are also used in CMD records.
     133Within these \xne{lexvo} seems most promising starting point, as it features URIs like \url{http://lexvo.org/id/term/eng/}, i.e. based on the ISO-639-3 language identifiers which are also used in CMD records.
    134134\xne{lexvo} also seems suitable as it is already linked with a number of LDL datasets among others \xne{WALS}, \xne{lingvoj}, \xne{Glottolog}.
    135 Of course, language is just one dimension to use for linking/mapping.
    136 Step by step we will link other categories like countries, geolocations, organisations, etc.
     135Of course, language is just one dimension to use for mapping.
     136Step by step we will link other categories like countries, geographica, organisations, etc.
    137137to some of the central nodes of the LOD cloud \cite{Cyganiak2010}, like \xne{dbpedia}, \xne{Yago} or \xne{geonames},
    138138but also domain-specific semantic resource like the ontology for language technology \xne{LT-World} \cite{Joerg2010} developed at DFKI.
     
    150150\subsection{CMD specification}\label{sec:CMDM}
    151151
    152 The main entity of the meta model is the CMD component modelled as A \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes\footnote{Due to space considerations the remainder of the paper will not discuss attributes.}, relation to the containing component)  it to has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external vocabularies/ semantic resources, the references to these entities are expressed in parallel properties of type \code{cmdm:hasElementEntity}. The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
     152The main entity of the meta model is the CMD component modelled as a \code{rdfs:Class}. A CMD profile is basically a CMD component with some extra features, implying a specialization relation. It may seem natural to translate a CMD element to a RDF property (as it holds the literal value), but given its complexity (e.g. attributes\footnote{Due to space considerations the abstract will not further discuss attributes.}, relation to the containing component)  it too has to be a \code{rdfs:Class}. The actual literal value is a property of given element of type \code{cmdm:ElementValue}. For values that can be mapped to entities defined in external semantic resources, the references to these entities are expressed in parallel object properties of type \code{cmdm:hasElementEntity} (outbound links). The containment relation between components and elements is expressed with a dedicated property \code{cmdm:contains}.
    153153
    154154\label{table:rdf-spec}
    155155\begin{example3}
    156 @prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . \\
     156\multicolumn{3}{l}{@prefix cmdm: \textless http://www.clarin.eu/cmd/general.rdf\#\textgreater . }\\
    157157\\
    158158\multicolumn{3}{l}{\# basic building blocks of CMD Model}  \\
     
    197197
    198198\noindent
    199 This entities are used for modelling the actual profiles, components and elements as they are defined in the CR.
    200 For stand-alone/top components, the IRI\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf} is the exact path into the CR to get the RDF representation for the profile/component. For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the parent top component IRI and dot-path to given component/element (Actor: \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}).\footnote{For the sake of readability, we will collapse the component IRIs, refer to them just by their name, prefixed with \code{cmd:}.}
    201 
    202 \begin{example3}
    203 cmd:collection & a & cmdm:Profile; \\
     199This top-level are subsequently used for modelling the actual profiles, components and elements as they are defined in the CR.
     200For stand-alone components, the IRI is the exact path into the CR to get the RDF representation for the profile/component\furl{http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:c\_1271859438125/rdf} . For ``inner'' components (that are defined as part of another component) and elements the identifier is a concatenation of the nearest ancestor stand-alone component's IRI and a dot-path to given component/element (e.g., Actor:\\ \code{cr:clarin.eu:cr1:c\_1271859438197/rdf\#Actor.Actor\_Languages.Actor\_Language}\footnote{For the sake of readability, in the examples we will collapse the component IRIs, refering to them just by their name, prefixed with \code{cmd:}.}.)
     201
     202\begin{example3}
     203cmd:collection& a & cmdm:Profile; \\
    204204 & rdfs:label & "collection"; \\
    205205 & dcterms:identifier & cr:clarin.eu:cr1:p\_1345561703620. \\
     
    208208
    209209\subsubsection{Data Categories}
    210 One of the semantic registries in use by CMDI for its concept links is ISOcat. In \cite{Windhouwer2012_LDL} proposes to link to the data categories via an annotation property.
     210The primary concept registry in use by CMDI for its concept links is ISOcat. The recommended approach to link to the data categories via an annotation property \cite{Windhouwer2012_LDL}.
    211211
    212212\begin{example3}
     
    217217\end{example3}
    218218
    219 The \code{@ConceptLink} attribute on CMD elements and components referencing the data category can be modelled as:
     219Consequently, the \code{@ConceptLink} attribute on CMD elements and components referencing the data category can be modelled as:
    220220
    221221\begin{example3}
     
    225225%\subsection{RELcat - Ontological relations}
    226226% \commentx{for now we could probably skip all of relcat (although it is the future of semantic mapping ;) - we spare something for the next paper.}
    227 
     227\begin{comment}
    228228Relations between data categories are not stored directly in the \xne{ISOcat} DCR, but rather in the dedicated Relation Registry \xne{RELcat} as RDF triples \cite{WINDHOUWER12.954} with dedicated predicates based on an extensible taxonomy of relation types. In the final paper, we will provide more details on the role of this important building block in the endeavour.
    229229
    230 \begin{comment}
     230
    231231A sample relation from the \xne{CMDI} relation set expressing a number of equivalences between \xne{ISOcat} data categories and \xne{dublincore} terms:
    232232
     
    259259It seems natural to use the PID of a Language Resource ( \code{<lr1>} ) as the resource identifier for the subject in the RDF representation. While this seems semantically sound, not every resource has to have a PID. (This is especially the case for ``virtual'' resources like collections, that are solely defined by their constituents and don't have any data on their own.) As a fall-back the PID of the MD record ( \code{<lr1.cmd>}  from \code{cmd:MdSelfLink} element) could be used as the resource identifier.
    260260If identifiers are present for both resource and metadata, \end{comment}
     261The PID of a Language Resource ( \code{<lr1>} ) is used as the IRI for the described resource in the RDF representation.
    261262The relationship between the resource and the metadata record can be expressed as an annotation using the \xne{OpenAnnotation} vocabulary\furl{http://openannotation.org/spec/core/core.html\#Motivations}.
    262263(Note, that one MD record can describe multiple resources. This can be also easily accommodated in OpenAnnotation.)
     
    265266\_:anno1  & a & oa:Annotation ; \\
    266267 & oa:hasTarget  & <lr1a>, <lr1b> ; \\
    267  & oa:hasBody  & <lr1.cmd> ; \\
     268 & oa:hasBody  & \_:topComponent1 ; \\
    268269 & oa:motivatedBy  & oa:describing . \\
    269270\end{example3}
     
    280281\end{example3}
    281282
    282 \subsubsection{Hierarchy ( Resource Proxy – IsPartOf)}
    283 In CMD, the \code{cmd:ResourceProxyList} structure is used to express both the collection hierarchy and point to resource(s) described by the CMD record. This can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
     283\subsubsection{Collection hierarchy}  % ( Resource Proxy – IsPartOf)}
     284
     285In CMD, there are dedicated generic elements -- the \code{cmd:ResourceProxyList} structure -- used to express both the collection hierarchy and to point to resource(s) described by the CMD record. The collection hierarchy can be modelled as \xne{OAI-ORE Aggregation}\furl{http://www.openarchives.org/ore/1.0/primer\#Foundations}
    284286:
    285287
     
    292294
    293295\begin{comment}
    294 This is rather complicated: skip this?:
    295296Additionally the flat header field \code{cmd:MdCollectionDisplayName} has been introduced to indicate by simple means the collection, of which given resource is part.
    296297This information can be used to generate a separate one-level grouping of the resources, in which the value from the \code{cmd:MdCollectionDisplayName} element would be used as the label of an otherwise undefined \code{ore:ResourceMap}.
     
    305306\end{comment}
    306307       
    307 \subsubsection{Components – nested structures}
     308\subsubsection{Components -- nested structures}
    308309For expressing the tree structure of the CMD records, i.e. the containment relation between the components a dedicated property \code{cmd:contains} is used:
    309310
     
    314315\end{example3}
    315316
    316 Additionally, we have to hook the top component to its containing metadata record.
     317\noindent
     318We use \code{cmdm:describesResource} for if the \code{@res} attribute is used , i.e., one or more references to a resource (via a proxy), on a component
    317319
    318320\begin{example3}
    319321\_:coll1 & a & cmd:collection. \\
    320 \_:coll1 & cmdm:describesResource & <lr1.cmd> . \\
     322\_:coll1 & cmdm:describesResource & <lr1> . \\
    321323\end{example3}
    322324
     
    363365
    364366\begin{enumerate}
    365 \item identify appropriate controlled vocabulares for individual metadata fields or data categories (manual task)
     367\item identify appropriate controlled vocabularies for individual metadata fields or data categories (manual task)
    366368\item extract \emph{distinct data category, value pairs} from the metadata records
    367369\item actual \textbf{lookup} of the individual literal values in given reference data (as indicated by the data category) to retrieve candidate entities, concepts
     
    422424The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. Once ready they will be integrated into the CMDI core infrastructure, e.g., the CR.
    423425%And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed.
    424 
    425 Once the linked data is available it has to be stored and published in a RDF triple store, which we will tackle in the final paper.
     426Once the linked data is available it has to be stored and published in a RDF triple store. We will elaborate on this aspect further in the final paper.
    426427%The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
    427428
Note: See TracChangeset for help on using the changeset viewer.