Changeset 3860 for CMDI-Interoperability


Ignore:
Timestamp:
10/23/13 14:25:33 (11 years ago)
Author:
vronk
Message:

stripped data section

File:
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex

    r3847 r3860  
    8888%
    8989
    90 The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centers offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observer. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata elements and terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure.
     90The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centres offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata %elements and
     91terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure.
    9192
    9293\commentx{Menzo: would be nice to include one of the UML diagrams.}
     
    9596\subsection{Current status of the joint CMD Domain}
    9697%
    97 To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
     98To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain.
     99%, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
    98100
    99101\subsubsection{CMD Profiles }
    100 In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined.
    101 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements.
     102In the CR 133 public profiles and 772 components are defined.
     103Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema.
     104%The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel.
     105The individual profiles differ also very much in their structure -- next to simple flat profiles
     106%with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles)
     107there are complex ones with up to 10 levels %(\textit{ExperimentProfile}, profiles for describing Web Services)
     108and a few hundred elements.
     109%The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements.
    102110
    103111\subsubsection{Instance Data}
    104112
    105113The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
    106 collects records from 69 providers on daily basis. The complete dataset amounts to over half a million records.
    107 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
    108 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
     114regularly collects records from the providers -- currently 69 over 550.000 records.
     11516 of the providers offer CMDI records, the other 53 provide around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile.
     116%Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
     117On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that all in all there is instance data for more than 60 profiles.
     118%So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    109119
    110120%
     
    403413\section{Implementation}
    404414
    405 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. The mappings described for the CMD specification (see section \ref{sec:CMDM}) have to be integrated into the CMDI core infrastructure, e.g., the CR. And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed.
    406 
    407 Once the linked data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
     415The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. The mappings described for the CMD specification (see section \ref{sec:CMDM}) have to be integrated into the CMDI core infrastructure, e.g., the CR.
     416%And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed.
     417
     418Once the linked data is available it has to be stored and published in a RDF triple store, which we will tackle in the final paper.
     419%The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana}
    408420
    409421% Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
Note: See TracChangeset for help on using the changeset viewer.