Changeset 4825
- Timestamp:
- 03/22/14 18:03:05 (11 years ago)
- Location:
- CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.bib
r4756 r4825 116 116 author = {Daan Broeder and Marc Kemps-Snijders and others}, 117 117 title = {A Data Category Registry- and Component-based Metadata Framework}, 118 booktitle = {LREC },118 booktitle = {LREC 2010}, 119 119 year = {2010}, 120 editor = {Nicoletta Calzolari and Khalid Choukri and others},121 120 address = {Valletta}, 122 121 month = {May}, … … 268 267 title = {The {META-SHARE} Metadata Schema for the Description of Language 269 268 Resources}, 270 booktitle = {LREC },269 booktitle = {LREC 2012}, 271 270 year = {2012}, 272 editor = {Nicoletta Calzolari and Khalid Choukri and others},273 271 address = {Istanbul}, 274 272 month = {May}, -
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex
r4822 r4825 42 42 \abstract{ 43 43 The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records. 44 In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer: the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for met data modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline44 In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer: the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metadata modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline 45 45 \Keywords{semantic mapping, metadata, research infrastructure} 46 46 } … … 74 74 \begin{tabular}{ l | r | r | r | r | r} 75 75 \hline 76 & 2011-01 & 2012-06 & 2013-01 & 2013-06 & 2014-0 1\\76 & 2011-01 & 2012-06 & 2013-01 & 2013-06 & 2014-03 \\ 77 77 \hline 78 Profiles & 40 & 53 & 87 & 124 & 15 8\\78 Profiles & 40 & 53 & 87 & 124 & 153\\ 79 79 Components & 164 & 298 & 542 & 828 & 1110 \\ 80 80 %Expanded Components & 1055 & 1536 & 2904 & 5757 \\ … … 95 95 % 96 96 Our task of determining similarity between schemas can be formulated as the schema/ontology matching problem. % -- trying to find correspondences between two schemas. 97 There is a plethora of work on methods and technology in th e field of \emph{schema and ontology matching}as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}98 (\cite{ Kalfoglou2003,Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).97 There is a plethora of work on methods and technology in this field as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification} 98 (\cite{Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more). 99 99 %(\cite{shvaiko2012ontology} even somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?'') 100 100 … … 137 137 Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines existing and, when needed, new components from the CR into a metadata profile. 138 138 %A profile is a component which basically defines the root of the metadata records that instantiate the profile. 139 Due to the flexibility of this model the metadata structures can be very specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts. 139 Due to the flexibility of this model the metadata structures can be very specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms %\cite{DCMI:2005} 140 and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts. 140 141 141 142 % … … 145 146 146 147 \subsection{CMD Profiles } 147 In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.148 149 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has117 components and 337 elements.148 In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the CR population over time. 149 150 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE \cite{Gavrilidou2012meta} for corpora with 117 components and 337 elements. 150 151 %(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}). 151 152 … … 155 156 The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} 156 157 collects records from 57 providers on a daily basis. The complete dataset amounts to around 600,000 records. 157 20 of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.158 20 of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. Next to these original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all more than 130.000 OLAC or DCMI-terms records are being collected. 158 159 On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles. 159 160 … … 220 221 %\end{table} 221 222 222 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources andrecords still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.223 We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 160,000 by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit. 223 224 224 225 \section{CMD cloud} … … 280 281 The SMC Browser and CMD cloud were developed primarily for assisting the task of metadata modelling. A modeller can get a quick overview of the existing profiles, their structure and their interrelations, allowing her to choose the most suitable one for describing the resources at hand. 281 282 282 When enriched with statistical information about instance data it can also serve as an alternative advanced interface for exploring the joint CLARIN metadata domain. It will offer the much needed 'big picture' for this huge heterogeneous collection of resources, an intuitively comprehensible visualization of its complex interrelations. This makes the tool also applicable for the metadata curation task, allowing to easily recognize structures and values that are being reused often ('hot spots') in contrast to outliers ('weak links'). With appropriate linking established the user canget from the structural overview (graph) directly to the corresponding records.283 When enriched with statistical information about instance data it can also serve as an alternative advanced interface for exploring the joint CLARIN metadata domain. It will offer the much needed 'big picture' for this huge heterogeneous collection of resources, an intuitively comprehensible visualization of its complex interrelations. This makes the tool also applicable for the metadata curation task, allowing to easily recognize structures and values that are being reused often in contrast to outliers. With appropriate linking established the user can alos get from the structural overview (graph) directly to the corresponding records. 283 284 284 285 \begin{figure*}
Note: See TracChangeset
for help on using the changeset viewer.