Changeset 4757 for CMDI-Interoperability


Ignore:
Timestamp:
03/18/14 20:20:45 (10 years ago)
Author:
xnrn@gmx.net
Message:

updated layout (to lrec2014 template)
reactivated some comments for the full version

Location:
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud
Files:
2 added
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex

    r4754 r4757  
    1 %\documentclass{article}
    2 \documentclass{llncs}
    3 \usepackage{llncsdoc}
     1\documentclass[10pt, a4paper]{article}
     2\usepackage{lrec2014}
     3
    44\usepackage{color}
    55\usepackage{graphicx}
    66\usepackage{amsmath}
     7%\usepackage{framed}
     8\usepackage{url}
     9
     10%\documentclass{article}
     11%\documentclass{llncs}
     12%\usepackage{llncsdoc}
     13%\usepackage{color}
     14%\usepackage{graphicx}
     15%\usepackage{amsmath}
    716
    817%\newcommand{\comment}[1]{}
     
    1019
    1120%%% PAGE DIMENSIONS
    12 \usepackage{geometry} % to change the page dimensions
    13 \geometry{a4paper} % or letterpaper (US) or a5paper or....
    14 \geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
     21%\usepackage{geometry} % to change the page dimensions
     22%\geometry{a4paper} % or letterpaper (US) or a5paper or....
     23%\geometry{margin=2.5cm} % for example, change the margins to 2 inches all round
    1524%\topmargin=-0.6in
    16 \textheight=700pt
     25%\textheight=700pt
    1726% \geometry{landscape} % set up the page for landscape
    1827%   read geometry.pdf for detailed page layout information
     
    2029
    2130%
    22 \begin{document}
    2331
    2432\title{The CMD Cloud}
    2533
    26 \author{Matej Durco\inst{1} \and Menzo Windhouwer\inst{2}}
    27 
    28 \institute{\email{matej.durco@assoc.oeaw.ac.at}\newline
    29 Institute for Corpus Linguistics and Text Technology (ICLTT), Vienna, Austria
    30 \and
    31 \email{menzo.windhouwer@dans.knaw.nl}\newline
    32 The Language Archive - DANS, The Hague, The Netherlands}
    33 
    34 \maketitle
    35 %
    36 %\begin{abstract}
    37 %The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resouce descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the modules of the infrastructure. Based on this solid grounding, the infrastructure accomodates a growing collection of metadata records.
    38 %In this paper, we give a short overview of the current status in the CMD data domain and harness the installed mechanisms for semantic interoperability to explore the relations/ similarity between individual profiles/schemas.
    39 %
    40 %\end{abstract}
     34\name{Matej \v{D}ur\v{c}o, Menzo Windhouwer}
     35
     36
     37\address{ Institute for Corpus Linguistics and Text Technology (ICLTT), The Language Archive - DANS \\
     38               Vienna, Austria, The Hague, The Netherlands \\
     39               matej.durco@oeaw.ac.at, menzo.windhouwer@dans.knaw.nl\\}
     40
     41
     42\abstract{
     43The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resouce descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the modules of the infrastructure. Based on this solid grounding, the infrastructure accomodates a growing collection of metadata records.
     44In this paper, we give a short overview of the current status in the CMD data domain and harness the installed mechanisms for semantic interoperability to explore the relations/ similarity between individual profiles/schemas.
     45}
     46
    4147%%
    4248%\begin{keywords}
     
    4551%\end{keywords}
    4652%
     53
     54\begin{document}
     55
     56\maketitleabstract
     57
    4758\section{Introduction}
    4859%
     
    5566As the main contribution -- grounded in the semantic mapping mechanisms of CMDI -- we propose a mechanism to compute and explore the relation/similarity among the profiles defined in CMD, delivering a bigger overall picture of the domain.
    5667
    57 %
    58 \section{Previous work}\label{lit}
    59 %
    60 Our task of determining similarity between schemas is a variant of the schema/ontology matching problem. % -- trying to find correspondences between two schemas.
    61 There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}
    62 (\cite{Kalfoglou2003,Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).
    63 %(\cite{shvaiko2012ontology} even somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?'')
    64 
    65 %However there is a fundamental difference between the common approaches and the work presented here, in that the semantic %layer of the CMDI makes the shared semantics explicit, rendering complex matching algorithms unnecessary.
    66 
    67 %\comment{OR (or some combination of the two)}
    68 
    69 %due to the fact that we can harness
    70 %Although the semantic interoperability layer built into the core of the CMD Infrastructure, integrating the task of identifying semantic
    71 Although the semantic layer of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation, makes to a high degree obsolete the need for complex a posteriori schema matching/mapping techniques, still, for the discussed task of schema similarity some of the techniques are relevant.
    72 In particular, we would like to point out the work by Ehrig \cite{EhrigSure2004,Ehrig2006} who defines \emph{ontology mapping} as a function on individual ontology entities based on a \emph{similarity} function, that for a pair of entities from two ontologies computes a ratio indicating their semantic proximity. This ratio is further used to derive the \emph{ontology similarity}, operationalized as a weighted aggregation function \cite{ehrig2004qom}, combining individual similarity measures.
    73 
    74 
    75 %Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
    76 
    77 %\comment{Menzo: I suggest to trim the following section as in the previous paragraph we say these methods are mostly obsolete for CMDI and then we zoom in on them. In the final paper we can maybe make more clear how they are still relevant.}
    78 %Still, we would like to point out the work by Ehrig on \emph{ontology alignment} \cite{EhrigSure2004,Ehrig2006}.
    79 %%defining \var{ontology mapping} as a function applied on individual ontology entities that ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
    80 %Ehrig introduces \emph{ontology mapping} as a function on individual ontology entities based on a \emph{similarity} function, that for a pair of entities from two ontologies computes a ratio indicating their semantic proximity.
    81 %This \emph{similarity} function %over single entities
    82 %is used to derive the notion of \emph{ontology similarity}, operationalized as a weighted aggregation function \cite{ehrig2004qom}, combining individual similarity measures.  computed for pairs of single entities again into one value (from the \emph{[0,1]} range) expressing the similarity ratio of the two ontologies being compared.
    83 %%Thus, \emph{ontology similarity} is a much weaker assertion, than \emph{ontology alignment}. In fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment.
    84 %Based on this abstraction a large number of different comparison features, as summarized in \cite{Shvaiko2005_classification,Algergawy2010,shvaiko2012ontology}, can be integrated into one coherent model.
    85 
    86 %%\begin{defcap}[!ht]
    87 %%\caption{\emph{map} function for single entities and underlying \emph{similarity} function }
    88 %%\begin{align*}
    89 %\begin{equation}\begin{split}
    90 %& map \ : O_{i1}  \rightarrow O_{i2} \\
    91 %& map( e_{i_{1}j_{1}}) = e_{i_{2}j_{2}}\text{, if } sim(e_{i_{1}j_{1}},e_{i_{2}j_{2}}) \ \textgreater \ t  \text{ with } t \text{ being the threshold} \\
    92 %& sim \ : E \times E \times O \times O \rightarrow [0,1]
    93 %\end{split}\end{equation}
    94 %%\end{align*} \end{defcap}
    95 
    96 One inspiration for this work was also the well-known LOD cloud\url{\footnote{http://lod-cloud.net/}} \cite{Cyganiak2010}.
    97 
    98 %
    99 \section{The Component Metadata Infrastructure}
    100 %
    101 Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines components from the CR into a metadata profile.
    102 %A profile is a component which basically defines the root of the metadata records that instantiate the profile.
    103 Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
    104 
    105 %
    106 \section{Current status of the joint CMD Domain}
    107 %
    108 %In the following section, we give an overview of the current status in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
    109 
    110 \subsubsection{CMD Profiles }
    111 In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
    112 
    113 Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has 117 components and 337 elements.
    114 %(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
    115 
    116 
    117 \begin{table}
     68\begin{table*}
     69%\begin{table}[t]
    11870\caption{The development of defined profiles and DCs over time.}
    11971\label{table:dev}
     
    13587  \end{tabular}
    13688\end{center}
    137 \end{table}
     89%\end{table}
     90\end{table*}
     91
     92%
     93\section{Previous work}\label{lit}
     94%
     95Our task of determining similarity between schemas is a variant of the schema/ontology matching problem. % -- trying to find correspondences between two schemas.
     96There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}
     97(\cite{Kalfoglou2003,Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).
     98%(\cite{shvaiko2012ontology} even somewhat self-critically asks if after years of research``the field of ontology matching [is] still making progress?'')
     99
     100%However there is a fundamental difference between the common approaches and the work presented here, in that the semantic %layer of the CMDI makes the shared semantics explicit, rendering complex matching algorithms unnecessary.
     101
     102%\comment{OR (or some combination of the two)}
     103
     104%due to the fact that we can harness
     105%Although the semantic interoperability layer built into the core of the CMD Infrastructure, integrating the task of identifying semantic
     106Although the semantic layer of the CMD Infrastructure, which integrates the task of identifying semantic correspondences directly into the process of schema creation, makes to a high degree obsolete the need for complex a posteriori schema matching/mapping techniques, still, for the discussed task of schema similarity some of the techniques are relevant.
     107In particular, we would like to point out the work by Ehrig \cite{EhrigSure2004,Ehrig2006} who defines \emph{ontology mapping} as a function on individual ontology entities based on a \emph{similarity} function, that for a pair of entities from two ontologies computes a ratio indicating their semantic proximity. This ratio is further used to derive the \emph{ontology similarity}, operationalized as a weighted aggregation function \cite{ehrig2004qom}, combining individual similarity measures.
     108
     109
     110%Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
     111
     112%\comment{Menzo: I suggest to trim the following section as in the previous paragraph we say these methods are mostly obsolete for CMDI and then we zoom in on them. In the final paper we can maybe make more clear how they are still relevant.}
     113%Still, we would like to point out the work by Ehrig on \emph{ontology alignment} \cite{EhrigSure2004,Ehrig2006}.
     114%%defining \var{ontology mapping} as a function applied on individual ontology entities that ``for each concept (node) in ontology A [tries to] find a corresponding concept (node), which has the same or similar semantics, in ontology B and vice verse''.
     115%Ehrig introduces \emph{ontology mapping} as a function on individual ontology entities based on a \emph{similarity} function, that for a pair of entities from two ontologies computes a ratio indicating their semantic proximity.
     116%This \emph{similarity} function %over single entities
     117%is used to derive the notion of \emph{ontology similarity}, operationalized as a weighted aggregation function \cite{ehrig2004qom}, combining individual similarity measures.  computed for pairs of single entities again into one value (from the \emph{[0,1]} range) expressing the similarity ratio of the two ontologies being compared.
     118%%Thus, \emph{ontology similarity} is a much weaker assertion, than \emph{ontology alignment}. In fact, the computed similarity is interpreted to assert ontology alignment: the aggregated similarity above a defined threshold indicates an alignment.
     119%Based on this abstraction a large number of different comparison features, as summarized in \cite{Shvaiko2005_classification,Algergawy2010,shvaiko2012ontology}, can be integrated into one coherent model.
     120
     121%%\begin{defcap}[!ht]
     122%%\caption{\emph{map} function for single entities and underlying \emph{similarity} function }
     123%%\begin{align*}
     124%\begin{equation}\begin{split}
     125%& map \ : O_{i1}  \rightarrow O_{i2} \\
     126%& map( e_{i_{1}j_{1}}) = e_{i_{2}j_{2}}\text{, if } sim(e_{i_{1}j_{1}},e_{i_{2}j_{2}}) \ \textgreater \ t  \text{ with } t \text{ being the threshold} \\
     127%& sim \ : E \times E \times O \times O \rightarrow [0,1]
     128%\end{split}\end{equation}
     129%%\end{align*} \end{defcap}
     130
     131One inspiration for this work was also the well-known LOD cloud\url{\footnote{http://lod-cloud.net/}} \cite{Cyganiak2010}.
     132
     133%
     134\section{The Component Metadata Infrastructure}
     135%
     136Naturally the core of CMDI consists of components. These components group metadata elements and possibly other components. The reusable components are managed by the Component Registry (CR). To describe a resource types a metadata modeller combines components from the CR into a metadata profile.
     137%A profile is a component which basically defines the root of the metadata records that instantiate the profile.
     138Due to the flexibility of this model the metadata structures can be very  specific to an organization, project or resource type. Although structures can thus vary considerably they are still within the domain of metadata for linguistic resources and thus share many key semantics. To deal with the variety general CMDI tools, e.g., the Virtual Language Observatory which is a facetted browser/search for CMD records, operate on a shared semantics layer. To establish these shared semantics CMD components, elements and values can be linked to so-called data categories (DC) defined in separate concept registries. The major concept registries currently in use by CMDI are the Dublin Core metadata elements and terms \cite{DCMI:2005} and the ISOcat Data Category Registry (DCR) \cite{Windhouwer+2012}. While the Dublin Core set of elements and terms is closed the ISOcat DCR is an open registry, which means that any metadata modeller can register the concepts it needs. Due to both the use of several concept registries and the open nature of some of these, multiple equivalent concepts can be created. CMDI uses the RELcat Relation Registry (RR) to create near sameness groups of these concepts.
     139
     140%
     141\section{Current status of the joint CMD Domain}
     142%
     143%In the following section, we give an overview of the current status in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
     144
     145\subsubsection{CMD Profiles }
     146In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
     147
     148Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services) and a few hundred elements, e.g., the maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora has 117 components and 337 elements.
     149%(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
    138150
    139151
     
    217229
    218230\subsection{SMC browser}
    219 The technological base for the presented method is the \textit{SMC browser}\footnote{\url{http://clarin.aac.ac.at/smc-browser}}, a web application being developed by the CMDI team, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. % The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
     231The technological base for the presented method is the \textit{SMC browser}\footnote{\url{http://clarin.aac.ac.at/smc-browser}}, a web application being developed by the CMDI team, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
    220232
    221233\subsection{Basic approach}
     
    255267%\end{description}
    256268
    257 %Initial results showed that there is a very high degree of interconnectedness in the generated graph, %(There are 7.835 links between the 157 profiles. A fully connected graph would have 12.403 edges.)
    258 %resulting from the fact, that every profile shares at least one or two data categories with many other profiles. However, besides making the resulting graph illegible and difficult to lay out, such a result is also not a good answer to the question of similarity. Therefore a threshold has to be introduced to only consider links above a certain similarity ratio.
     269Initial results showed that there is a very high degree of interconnectedness in the generated graph, (There are 7.835 links between the 157 profiles. A fully connected graph would have 12.403 edges.) resulting from the fact, that every profile shares at least one or two data categories with many other profiles. However, besides making the resulting graph illegible and difficult to lay out, such a result is also not a good answer to the question of similarity. Therefore a threshold has to be introduced to only consider links above a certain similarity ratio.
    259270
    260271\subsection{Result}
     
    273284\label{ext}
    274285There are a number of further factors, that could be taken into account, when computing the profiles similarity. 
    275 The obvious next step is to incorporate the component reuse. Applying the relations between data categories as defined in Relation Registry would further grow the similarity ratios. Also, we need to cater for profiles with little data categories coverage. This can be resolved by including the data-category-coverage-ratio into the calculation. One could compute the graph on the basis of the instance data, with the node size representing the number of instances and edge width the amount of data in the shared data categories.
     286The obvious next step is to consider the component reuse. Applying the relations between data categories as defined in Relation Registry would further grow the similarity ratios. Also, we need to cater for profiles with little data categories coverage. This can be resolved by including the data-category-coverage-ratio into the calculation. One could compute the graph on the basis of the instance data, with the node size representing the number of instances and edge width the amount of data in the shared data categories.
    276287
    277288We also plan to adopt more sophisticated approaches to compute entity and aggregated schema similarity as proposed in \cite{ehrig2004qom,Ehrig2006}.
     
    282293This work offering a bird's eye view on the CMD data domain can serve as alternative starting point for exploring the dataset and provides valuable input for metadata modellers and the metadata curation task.
    283294
    284 \bibliographystyle{splncs}
     295\bibliographystyle{lrec2014}
    285296\bibliography{CMDcloud}
    286297
Note: See TracChangeset for help on using the changeset viewer.