Changeset 3860 for CMDI-Interoperability
- Timestamp:
- 10/23/13 14:25:33 (11 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC/CMD2RDF.tex
r3847 r3860 88 88 % 89 89 90 The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centers offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observer. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata elements and terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure. 90 The natural building blocks of CMDI are components. Components are used to group elements and attributes, which can take values, and also other components. A coherent component, e.g., a component to capture information on a contact person or one for project information, can be reused and is stored for that in the Component Registry (CR). A metadata modeller selects components and combines them into a profile targeted at a specific resource type, a collection of resources or a project, tool or service. A profile can be used as the schema for a metadata record. CLARIN centres offer these CMD records to the joint metadata domain. There are some generic tools which operate on all the CMD records in this domain, e.g., the Virtual Language Observatory. These tools have to deal with the variety of CMD profiles. They can do so by operating on a semantic level, as components, elements and values can all be annotated with links to concepts in various registries. Currently used registries are the Dublin Core metadata %elements and 91 terms and the ISOcat Data Category Registry. These concept links allow profiles, while being diverse in structure, to share semantics. Generic tools can use these semantics to overcome differences in terminology and also in structure. 91 92 92 93 \commentx{Menzo: would be nice to include one of the UML diagrams.} … … 95 96 \subsection{Current status of the joint CMD Domain} 96 97 % 97 To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records. 98 To provide a frame of reference for the proportions of the undertaking, this section gives a few numbers about the data in the CMD domain. 99 %, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records. 98 100 99 101 \subsubsection{CMD Profiles } 100 In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. 101 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements. 102 In the CR 133 public profiles and 772 components are defined. 103 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. 104 %The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. 105 The individual profiles differ also very much in their structure -- next to simple flat profiles 106 %with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) 107 there are complex ones with up to 10 levels %(\textit{ExperimentProfile}, profiles for describing Web Services) 108 and a few hundred elements. 109 %The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 components and 337 elements. 102 110 103 111 \subsubsection{Instance Data} 104 112 105 113 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} 106 collects records from 69 providers on daily basis. The complete dataset amounts to over half a million records. 107 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. 108 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles. 114 regularly collects records from the providers -- currently 69 over 550.000 records. 115 16 of the providers offer CMDI records, the other 53 provide around 140.000 OLAC/DC records, that are converted into the corresponding CMD profile. 116 %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. 117 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different ones), so that all in all there is instance data for more than 60 profiles. 118 %So we encounter both situations: one profile being used by many providers and one provider using many profiles. 109 119 110 120 % … … 403 413 \section{Implementation} 404 414 405 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. The mappings described for the CMD specification (see section \ref{sec:CMDM}) have to be integrated into the CMDI core infrastructure, e.g., the CR. And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed. 406 407 Once the linked data is available it has to be stored and published in a RDF triple store. The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana} 415 The transformation of profiles and instances into RDF/XML is accomplished by a set of XSL-stylesheets, that are currently being tested on a sample dataset. The mappings described for the CMD specification (see section \ref{sec:CMDM}) have to be integrated into the CMDI core infrastructure, e.g., the CR. 416 %And in the near future, a test on the instances in the complete CLARIN joint metadata domain will be performed. 417 418 Once the linked data is available it has to be stored and published in a RDF triple store, which we will tackle in the final paper. 419 %The most promising solution seems to be \xne{Virtuoso}, an integrated feature-rich hybrid data store, able to deal with different types of data (``Universal Data Store''). \cite{Haslhofer2011europeana} 408 420 409 421 % Although the distributed nature of the data is one of the defining features of LOD and theoretically one should be able to follow the data by dereferencable URIs, in practice it is mostly necessary to pool into one data store linked datasets from different sources that shall be queried together due to performance reasons. This implies that the data to be kept by the data store will be decisively larger, than ``just'' the original dataset.
Note: See TracChangeset
for help on using the changeset viewer.