Context Navigation

← Previous Change
Next Change →

Changeset 4821 for CMDI-Interoperability

Timestamp:

03/22/14 11:01:37 (10 years ago)

Author:

Menzo Windhouwer

Message:

M CMDcloud.pdf
M CMDcloud.tex

added keywords
added VLO URL in footnote
updated stats
added footnote on public vs private
other (minor) changes

Location:

CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud

Files:

: 2 edited

CMDcloud.pdf (modified) (previous)
CMDcloud.tex (modified) (5 diffs)

Legend:

: Unmodified
: Added
: Removed

CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex

-                      r4820
+                      r4821
 \abstract{
+The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the modules of the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records.
+In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer â the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modeller, for metadata curation task as well as for general audience seeking for a 'big picture' of the complex CMD data domain.
+The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records.
+In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer â the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline
+\Keywords{semantic mapping, metadata, research infrastructure}
+}
 %%
+%
 %\begin{keywords}
 %semantic mapping, metadata, research infrastructure
 %%metamodel, research infrastructure
+%metamodel, research infrastructure
 %\end{keywords}
+%
 …
 \section{The Component Metadata Infrastructure}
+%
 Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines components from the CR into a metadata profile.
+Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines existing and, when needed, new components from the CR into a metadata profile.
 %A profile is a component which basically defines the root of the metadata records that instantiate the profile.
 Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
+Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
+%
 …
 \subsection{CMD Profiles }
 In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
+In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has 117 components and 337 elements.
 …
 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
 collects records from 69 providers on a daily basis. The complete dataset amounts to around half a million records.
 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 139.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
+collects records from 57 providers on a daily basis. The complete dataset amounts to around 600,000 records.
+of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles.
 …
 %\end{table}
 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.
+We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.
 \section{CMD cloud}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 4821 for CMDI-Interoperability

Legend:

CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex

Download in other formats: