source: CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex @ 4821

Last change on this file since 4821 was 4821, checked in by Menzo Windhouwer, 10 years ago

M CMDcloud.pdf
M CMDcloud.tex

  • added keywords
  • added VLO URL in footnote
  • updated stats
  • added footnote on public vs private
  • other (minor) changes
File size: 24.3 KB
Line 
1\documentclass[10pt, a4paper]{article}
2\usepackage{lrec2014}
3
4\usepackage{color}
5\usepackage{graphicx}
6\usepackage{amsmath}
7%\usepackage{framed}
8\usepackage{url}
9
10%\documentclass{article}
11%\documentclass{llncs}
12%\usepackage{llncsdoc}
13%\usepackage{color}
14%\usepackage{graphicx}
15%\usepackage{amsmath}
16
17%\newcommand{\comment}[1]{}
18\newcommand{\comment}[1]{\textcolor{red}{#1}}
19
20%%% PAGE DIMENSIONS
21%\usepackage{geometry} % to change the page dimensions
22%\geometry{a4paper} % or letterpaper (US) or a5paper or....
23%\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
24%\topmargin=-0.6in
25%\textheight=700pt
26% \geometry{landscape} % set up the page for landscape
27%   read geometry.pdf for detailed page layout information
28
29
30%
31
32\title{The CMD Cloud}
33
34\name{Matej \v{D}ur\v{c}o, Menzo Windhouwer}
35
36
37\address{ Institute for Corpus Linguistics and Text Technology (ICLTT), The Language Archive - DANS \\
38               Vienna, Austria, The Hague, The Netherlands \\
39               matej.durco@oeaw.ac.at, menzo.windhouwer@dans.knaw.nl\\}
40
41
42\abstract{
43The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records.
44In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer – the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modellers, for metadata curation tasks as well as for general audience seeking for a 'big picture' of the complex CMD data domain. \\ \newline
45\Keywords{semantic mapping, metadata, research infrastructure}
46}
47
48%
49%\begin{keywords}
50%semantic mapping, metadata, research infrastructure
51%metamodel, research infrastructure
52%\end{keywords}
53%
54
55\begin{document}
56
57\maketitleabstract
58
59\section{Introduction}
60%
61
62The Component Metadata Infrastructure (CMDI, \cite{Broeder+2010}) conceived within the CLARIN project is now 5 years old and thriving. By allowing a flexible yet harmonized definition of metadata schemas, it has offered a robust common framework for consolidating the scattered landscape of resource descriptions in the LRT community, without trying to impose/prescribe one schema to cover all the resources (which seems futile in the light of the variety of resources to be described).
63
64A look into the data domain shows that the basic concept of a flexible metamodel with integrated semantic layer is being taken up by the community. Metadata modellers are increasingly making use not only of the infrastructure, but are also reusing the modelling work done so far.
65
66In this paper, we first -- for methodical foundation -- briefly summarize previous work, then give a short overview of the current status of the infrastructure both on the schema and instance level.
67As the main contribution -- grounded in the semantic mapping mechanisms of CMDI -- we propose a mechanism to compute and explore the relation/similarity among the profiles defined in CMD, delivering a bigger overall picture of the domain.
68
69\begin{table*}
70%\begin{table}[t]
71\caption{The development of defined profiles and DCs over time.}
72\label{table:dev}
73\begin{center}
74  \begin{tabular}{ l | r | r | r | r | r}
75    \hline
76     & 2011-01 & 2012-06 & 2013-01 & 2013-06  & 2014-01 \\
77    \hline
78Profiles & 40 & 53 & 87 & 124 &  158\\
79Components & 164 & 298 & 542 & 828 & 1110 \\
80%Expanded Components & 1055 & 1536 & 2904 & 5757 \\
81Elements & 511 & 893 & 1505 & 2399 & 3101 \\
82% Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
83Distinct data categories & 203 & 266 & 436 & 499 & 737 \\
84% Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
85Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% & 24,2\%\\
86% Components with DCs & 28 & 67 & 115 & 140 \\
87    \hline
88  \end{tabular}
89\end{center}
90%\end{table}
91\end{table*}
92
93%
94\section{Previous work}\label{lit}
95%
96Our task of determining similarity between schemas can be formulated as the schema/ontology matching problem. % -- trying to find correspondences between two schemas.
97There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}
98(\cite{Kalfoglou2003,Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).
99%(\cite{shvaiko2012ontology} even somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?'')
100
101%However there is a fundamental difference between the common approaches and the work presented here, in that the semantic %layer of the CMDI makes the shared semantics explicit, rendering complex matching algorithms unnecessary.
102
103%\comment{OR (or some combination of the two)}
104
105%due to the fact that we can harness
106%Although the semantic interoperability layer built into the core of the CMD Infrastructure, integrating the task of identifying semantic
107Although the semantic layer of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation, makes to a high degree obsolete the need for complex a posteriori schema matching/mapping techniques, still, for the discussed task of schema similarity some of the techniques are relevant.
108In particular, we would like to point out the work by Ehrig \cite{EhrigSure2004,Ehrig2006} who defines \emph{ontology mapping} as a function on individual ontology entities based on a \emph{similarity} function, that for a pair of entities from two ontologies computes a ratio indicating their semantic proximity. This ratio is further used to derive the \emph{ontology similarity}, operationalized as a weighted aggregation function \cite{ehrig2004qom}, combining individual similarity measures.
109
110
111%Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
112
113%\comment{Menzo: I suggest to trim the following section as in the previous paragraph we say these methods are mostly obsolete for CMDI and then we zoom in on them. In the final paper we can maybe make more clear how they are still relevant.}
114%Still, we would like to point out the work by Ehrig on \emph{ontology alignment} \cite{EhrigSure2004,Ehrig2006}.
115%%defining \var{ontology mapping} as a function applied on individual ontology entities that ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
116%Ehrig introduces \emph{ontology mapping} as a function on individual ontology entities based on a \emph{similarity} function, that for a pair of entities from two ontologies computes a ratio indicating their semantic proximity.
117%This \emph{similarity} function %over single entities
118%is used to derive the notion of \emph{ontology similarity}, operationalized as a weighted aggregation function \cite{ehrig2004qom}, combining individual similarity measures.  computed for pairs of single entities again into one value (from the \emph{[0,1]} range) expressing the similarity ratio of the two ontologies being compared.
119%%Thus, \emph{ontology similarity} is a much weaker assertion, than \emph{ontology alignment}. In fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment.
120%Based on this abstraction a large number of different comparison features, as summarized in \cite{Shvaiko2005_classification,Algergawy2010,shvaiko2012ontology}, can be integrated into one coherent model.
121
122%%\begin{defcap}[!ht]
123%%\caption{\emph{map} function for single entities and underlying \emph{similarity} function }
124%%\begin{align*}
125%\begin{equation}\begin{split}
126%& map \ : O_{i1}  \rightarrow O_{i2} \\
127%& map( e_{i_{1}j_{1}}) = e_{i_{2}j_{2}}\text{, if } sim(e_{i_{1}j_{1}},e_{i_{2}j_{2}}) \ \textgreater \ t  \text{ with } t \text{ being the threshold} \\
128%& sim \ : E \times E \times O \times O \rightarrow [0,1]
129%\end{split}\end{equation}
130%%\end{align*} \end{defcap}
131
132One inspiration for this work was also the well-known LOD cloud\footnote{\url{http://lod-cloud.net/}} \cite{Cyganiak2010}.
133
134%
135\section{The Component Metadata Infrastructure}
136%
137Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines existing and, when needed, new components from the CR into a metadata profile.
138%A profile is a component which basically defines the root of the metadata records that instantiate the profile.
139Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory\footnote{\url{http://www.clarin.eu/vlo/}} which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
140
141%
142\section{Current status of the joint CMD Domain}
143%
144In the following section, we give an overview of the current status in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
145
146\subsection{CMD Profiles }
147In the CR 153\footnote{All numbers are as of 2014-03 if not stated otherwise} public\footnote{Users of the CR create components and profiles in their private workspace, and they can make them public when the components or profiles are ready for production.} Profiles and 859 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
148
149Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has 117 components and 337 elements.
150%(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
151
152
153\subsection{Instance Data}
154
155The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
156collects records from 57 providers on a daily basis. The complete dataset amounts to around 600,000 records.
15720 of the providers offer CMDI records, the other 37 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting, amounting to round 44.000 records. %Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
158On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles). So we encounter both situations: one profile being used by many providers and one provider using many profiles.
159
160%\begin{table}
161%\caption{Top 20 profiles, with the respective number of records}
162%\begin{center}
163%  \begin{tabular}{ r | l }
164%\# records & profile \\
165%   \hline
166%155.403 & Song \\
167%138.257 & Session \\
168%92.996 & OLAC-DcmiTerms \\
169%46.156 & DcmiTerms \\
170%28.448 & SongScan \\
171%21.256 & SourceScan \\
172%19.059 & LiteraryCorpusProfile \\
173%16519 & Source \\
174%13626 & imdi-corpus \\
175%10610 & media-session-profile \\
176%7961 & SongAudio \\   
177%7557 & SymbolicMusicNotation \\
178%4485 & LCC DataProviderProfile \\
179%4485 & SourceProfile \\
180%4417 & Text \\
181%1982 & Soundbites-recording \\
182%1530 & Performer \\
183%1475 & ArthurianFiction \\
184%939 & LrtInventoryResource \\
185%873 & teiHeader \\
186%    \hline
187%  \end{tabular}
188%\end{center}
189%\end{table}
190
191%\begin{table}
192%\caption{Top 20 collections, with the respective number of records}
193%\begin{center}
194% \begin{tabular}{ r | l }
195%\# records & colleciton \\
196%    \hline
197%243.129 & Meertens collection: Liederenbank \\
198%46.658 & DK-CLARIN Repository \\
199%46.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\
200%29.266 & childes \\
201%24.583 & DoBeS archive \\
202%23.185 & Language and Cognition \\
203%14.593 & talkbank \\
204%14.363 & Acquisition \\
205%14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
206%12.893 & MPI CGN \\
207%10.628 & Bavarian Archive for Speech Signals (BAS) \\
208%7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
209%7.348 & WALS RefDB \\
210%5.689 & Lund Corpora \\
211%4.640 & Oxford Text Archive \\
212%4.492 & Leipzig Corpora Collection \\
213%3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
214%3.280 & A Digital Archive of Research Papers in Computational Linguistics \\
215%3.147 & CLARIN NL \\
216%3.081 & MPI fÃŒr Bildungsforschung \\   
217%\hline
218%  \end{tabular}
219%\end{center}
220%\end{table}
221
222We can also observe a large disparity on the amount of records between individual providers and profiles. Almost 250,000 records are provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). There is ongoing work to make the various granularity levels more explicit.
223
224\section{CMD cloud}
225
226As the data set keeps growing both in numbers and in complexity, there is a rising need for advanced ways to explore it.
227In this work, we present a method to analyze and visualize the relations among defined CMD profiles,
228with the \emph{schema matching} -- in particular, the mapping and similarity function proposed by \cite{EhrigSure2004,Ehrig2006} -- serving as methodical basis.
229%formulated as an application of the \emph{schema matching} task. In particular,  the work on mapping and underlying similarity function as introduced in \cite{EhrigSure2004,Ehrig2006} shall serve as methodical basis.
230
231\subsection{SMC browser}
232The technological base for the presented method is the \textit{SMC browser}\footnote{\url{http://clarin.oeaw.ac.at/smc-browser}}, a web application being developed by the CMDI team, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
233
234\subsection{Basic approach}
235The basic idea for constructing the CMD cloud is to
2361) collect the size of each profile (as the number of components and elements, or number of distinct data categories used); 2) compute the pairwise similarity ratio between the profiles based on some similarity measure; 3) generate a graph with profiles as nodes and the pairwise similarity relation expressed as weighted edges between them.
237When rendered, the size of the nodes in the graph reflects the size of the profile as computed before.
238The absolute number of matching identities is expressed as edge weight and the similarity ratio
239as \emph{link strength} (inversely proportional to link distance), drawing more similar profiles nearer together.
240Additionally, a variable threshold governs the level of similarity to be rendered as link.
241
242\subsection{Similarity ratio}
243At the core of the discussed method is the concept of similarity between entities and the challenge how to operationalize it.
244In the initial step, the similarity ratio is based on the most reliable information, the reuse of % components and/or
245data categories, computed as the average of the quotients of matching distinct data categories for each of the two profiles.
246
247\begin{equation}\begin{split}
248 sim_{p1} &:= \cfrac{count(distinct(Datcats_{match}))}{count(distinct(Datcats_{p1}))} \\
249 sim_{p2} &:= \cfrac{count(distinct(Datcats_{match}))}{count(distinct(Datcats_{p2}))} \\
250 sim &:= \frac{(sim_{p1} + sim_{p2})}{2}
251\end{split}\end{equation}
252
253Note though, that there is a number of other features and formulas that can be used to assess the similarity of two schemas (structures) (cf. \ref{ext}). 
254
255
256%and several factors that need to be considered even with this presumably simple feature:
257%\begin{description}
258%\item[How to handle data categories used multiple times in one profile?]
259  %Count every data category only once.
260%\item[How to incorporate reused components?] Count distinct identities, i.e. all of the shared structures, but every component or element only once, even if reused.
261%\item[How to cater for missing data categories?] If data categories cover only a small portion of profiles fields, good matches with other profiles can result, even though the profiles can be quite different. This can be resolved by including the data-category-coverage-ratio into the calculation.
262
263% \comment{Mate: I am afraid this is too bare of any statistical methods}
264%\item[What to use as base/denominator for the quotient?] The overall number of used data categories in one of the profiles, the sum of both or average of the individual quotients.
265
266%denominator of the quotient is the sum of data category counts from both profiles:
267%\begin{equation}\begin{split}
268% sim := \cfrac{count(distinct(Datcats_{match}))}{count(distinct(datcats_{p1})) + count(distinct(datcats_{p2}))}
269%\end{split}\end{equation}
270%\end{description}
271
272
273\subsection{Results}
274The basic result is the graph of profiles with links based on their similarity. There are various ways to render this information.
275As SMC browser allows to select different subgraphs and adapt layout options, figure \ref{fig:CMDcloud} depicts just one possible visual output of the analysis. This view shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles. SMC Browser also features alternative more detailed views that allow to detect visually which components and data categories are shared by which profiles. In a way a zoom in on the links between the nodes in the CMD cloud.
276
277The generated graph manifests a very high degree of interconnectedness in the generated graph (There are 7.835 links between the 157 profiles. A fully connected graph would have 12.403 edges.) resulting from the fact, that every profile shares at least one or two data categories with many other profiles. However, besides making the rendered graph illegible and difficult to lay out, such a result is also not a good answer to the question of similarity. Therefore a threshold was introduced to only consider links above a certain similarity ratio.
278
279\subsection{Applications}
280The SMC Browser and CMD cloud were developed primarily for assisting the task of metadata modelling. A modeller can get a quick overview of the existing profiles, their structure and their interrelations, allowing her to choose the most suitable one for describing the resources at hand.
281
282When enriched with statistical information about instance data it can also serve as an alternative advanced interface for exploring the joint CLARIN metadata domain. It will offer the much needed 'big picture' for this huge heterogeneous collection of resources, an intuitively comprehensible visualization of its complex interrelations. This makes the tool also applicable for the metadata curation task, allowing to easily recognize structures and values that are being reused often ('hot spots') in contrast to outliers ('weak links'). With appropriate linking established the user can get from the structural overview (graph) directly to the corresponding records.
283
284\begin{figure*}
285\begin{center}
286%\hspace{-0.1\textwidth}
287\includegraphics[width=\textwidth]{just_profiles_9}
288\end{center}
289\caption{A graph view of the similarity relations between CMD profiles (\textit{threshold=0.6})}
290\label{fig:CMDcloud}
291\end{figure*}
292
293\subsection{Planned extensions}
294\label{ext}
295There are a number of further factors, that can be taken into account, when computing the profiles similarity. 
296The obvious next step is to consider the component reuse. Applying the relations between data categories as defined in Relation Registry would further raise the similarity ratios. Also, we need to cater for profiles with little data categories coverage. This can be resolved by including the data-category-coverage-ratio into the calculation.
297
298We also plan to adopt more sophisticated approaches to compute entity and aggregated schema similarity as proposed in \cite{ehrig2004qom,Ehrig2006}, like string or structural similarity between 'nodes'.
299
300A very important planned addition opening a whole new field of applications is to integrate statistical information about instance data into the generation of the graph. In the 'instance'-mode node size would represent the number of instances for given profile and edge width the amount of data in the shared data categories.  On instance level, also the ratio of shared values between fields/elements  could be computed and used as another similarity indicator (though computationally very demanding).
301
302
303\section{Conclusions} % and Future Work is already in the extensions
304In this paper, we gave a short overview of the current status of the CMD data domain as basis for the main contribution: an analysis of the semantic similarity between the profiles.
305%In our view, this work does not just render a fancy picture, but
306This work offering a bird's eye view on the CMD data domain can serve as alternative starting point for exploring the dataset and provides valuable input for metadata modellers and the metadata curation task.
307
308\bibliographystyle{lrec2014}
309\bibliography{CMDcloud}
310
311\end{document}
Note: See TracBrowser for help on using the repository browser.