Changeset 4825 for CMDI-Interoperability


Ignore:
Timestamp:
03/22/14 18:03:05 (10 years ago)
Author:
xnrn@gmx.net
Message:

minor changes on instance data numbers.
squeezed to 4 pages.

Location:
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.bib

    r4756 r4825  
    116116  author = {Daan Broeder and Marc Kemps-Snijders and others},
    117117  title = {A Data Category Registry- and Component-based Metadata Framework},
    118   booktitle = {LREC},
     118  booktitle = {LREC 2010},
    119119  year = {2010},
    120   editor = {Nicoletta Calzolari and Khalid Choukri and others},
    121120  address = {Valletta},
    122121  month = {May},
     
    268267  title = {The {META-SHARE} Metadata Schema for the Description of Language
    269268        Resources},
    270   booktitle = {LREC},
     269  booktitle = {LREC 2012},
    271270  year = {2012},
    272   editor = {Nicoletta Calzolari and Khalid Choukri and others},
    273271  address = {Istanbul},
    274272  month = {May},
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex

    r4822 r4825  
    4242\abstract{
    4343The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records.
    44 In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer: the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline
     44In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer: the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metadata modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline
    4545\Keywords{semantic mapping, metadata, research infrastructure}
    4646}
     
    7474  \begin{tabular}{ l | r | r | r | r | r}
    7575    \hline
    76      & 2011-01 & 2012-06 & 2013-01 & 2013-06  & 2014-01 \\
     76     & 2011-01 & 2012-06 & 2013-01 & 2013-06  & 2014-03 \\
    7777    \hline
    78 Profiles & 40 & 53 & 87 & 124 &  158\\
     78Profiles & 40 & 53 & 87 & 124 &  153\\
    7979Components & 164 & 298 & 542 & 828 & 1110 \\
    8080%Expanded Components & 1055 & 1536 & 2904 & 5757 \\
     
    9595%
    9696Our task of determining similarity between schemas can be formulated as the schema/ontology matching problem. % -- trying to find correspondences between two schemas.
    97 There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}
    98 (\cite{Kalfoglou2003,Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).
     97There is a plethora of work on methods and technology in this field as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}
     98(\cite{Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).
    9999%(\cite{shvaiko2012ontology} even somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?'')
    100100
     
    137137Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines existing and, when needed, new components from the CR into a metadata profile.
    138138%A profile is a component which basically defines the root of the metadata records that instantiate the profile.
    139 Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
     139Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms %\cite{DCMI:2005}
     140and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
    140141
    141142%
     
    145146
    146147\subsection{CMD Profiles }
    147 In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
    148 
    149 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has 117 components and 337 elements.
     148In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the CR population over time.
     149
     150Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE \cite{Gavrilidou2012meta} for corpora with 117 components and 337 elements.
    150151%(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
    151152
     
    155156The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
    156157collects records from 57 providers on a daily basis. The complete dataset amounts to around 600,000 records.
    157 20 of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
     15820 of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. Next to these original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all more than 130.000 OLAC or DCMI-terms records are being collected.
    158159On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles.
    159160
     
    220221%\end{table}
    221222
    222 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.
     223We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 160,000 by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.
    223224
    224225\section{CMD cloud}
     
    280281The SMC Browser and CMD cloud were developed primarily for assisting the task of metadata modelling. A modeller can get a quick overview of the existing profiles, their structure and their interrelations, allowing her to choose the most suitable one for describing the resources at hand.
    281282
    282 When enriched with statistical information about instance data it can also serve as an alternative advanced interface for exploring the joint CLARIN metadata domain. It will offer the much needed 'big picture' for this huge heterogeneous collection of resources, an intuitively comprehensible visualization of its complex interrelations. This makes the tool also applicable for the metadata curation task, allowing to easily recognize structures and values that are being reused often ('hot spots') in contrast to outliers ('weak links'). With appropriate linking established the user can get from the structural overview (graph) directly to the corresponding records.
     283When enriched with statistical information about instance data it can also serve as an alternative advanced interface for exploring the joint CLARIN metadata domain. It will offer the much needed 'big picture' for this huge heterogeneous collection of resources, an intuitively comprehensible visualization of its complex interrelations. This makes the tool also applicable for the metadata curation task, allowing to easily recognize structures and values that are being reused often in contrast to outliers. With appropriate linking established the user can alos get from the structural overview (graph) directly to the corresponding records.
    283284
    284285\begin{figure*}
Note: See TracChangeset for help on using the changeset viewer.