Changeset 4821 for CMDI-Interoperability
- Timestamp:
- 03/22/14 11:01:37 (10 years ago)
- Location:
- CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex
r4820 r4821 41 41 42 42 \abstract{ 43 The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the modules of the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records. 44 In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer â the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modeller, for metadata curation task as well as for general audience seeking for a 'big picture' of the complex CMD data domain. 43 The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records. 44 In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer â the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline 45 \Keywords{semantic mapping, metadata, research infrastructure} 45 46 } 46 47 47 % %48 % 48 49 %\begin{keywords} 49 50 %semantic mapping, metadata, research infrastructure 50 % %metamodel, research infrastructure51 %metamodel, research infrastructure 51 52 %\end{keywords} 52 53 % … … 134 135 \section{The Component Metadata Infrastructure} 135 136 % 136 Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines components from the CR into a metadata profile.137 Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines existing and, when needed, new components from the CR into a metadata profile. 137 138 %A profile is a component which basically defines the root of the metadata records that instantiate the profile. 138 Due to the flexibility of this model the metadata structures can be very specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.139 Due to the flexibility of this model the metadata structures can be very specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts. 139 140 140 141 % … … 144 145 145 146 \subsection{CMD Profiles } 146 In the CR 1 33\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.147 In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time. 147 148 148 149 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has 117 components and 337 elements. … … 153 154 154 155 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} 155 collects records from 69 providers on a daily basis. The complete dataset amounts to around half a millionrecords.156 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 139.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.156 collects records from 57 providers on a daily basis. The complete dataset amounts to around 600,000 records. 157 20 of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. 157 158 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles. 158 159 … … 219 220 %\end{table} 220 221 221 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records isprovided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.222 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit. 222 223 223 224 \section{CMD cloud}
Note: See TracChangeset
for help on using the changeset viewer.