Changeset 4820


Ignore:
Timestamp:
03/22/14 07:49:48 (11 years ago)
Author:
xnrn@gmx.net
Message:

update for final version
added abstract and some paragraphs

Location:
CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud
Files:
1 added
1 edited

Legend:

Unmodified
Added
Removed
  • CMDI-Interoperability/CMD2RDF/trunk/docs/papers/2014-LREC-CMDcloud/CMDcloud.tex

    r4757 r4820  
    4141
    4242\abstract{
    43 The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resouce descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the modules of the infrastructure. Based on this solid grounding, the infrastructure accomodates a growing collection of metadata records.
    44 In this paper, we give a short overview of the current status in the CMD data domain and harness the installed mechanisms for semantic interoperability to explore the relations/ similarity between individual profiles/schemas.
     43The CLARIN Component Metadata Infrastructure (CMDI) established means for flexible resource descriptions for the domain of language resources with sound provisions for semantic interoperability weaved deeply into the meta model and the modules of the infrastructure. Based on this solid grounding, the infrastructure accommodates a growing collection of metadata records.
     44In this paper, we give a short overview of the current status in the CMD data domain on the schema and instance level and harness the installed mechanisms for semantic interoperability to explore the similarity relations between individual profiles/schemas. We propose a method to use the semantic links shared among the profiles to generate/compile a similarity graph. This information is further rendered in an interactive graph viewer – the SMC Browser. The resulting interactive graph offers an intuitive view on the complex interrelations of the discussed dataset revealing clusters of more similar profiles. This information is useful both for metdata modeller, for metadata curation task as well as for general audience seeking for a 'big picture' of the complex CMD data domain.
    4545}
    4646
     
    6161The Component Metadata Infrastructure (CMDI, \cite{Broeder+2010}) conceived within the CLARIN project is now 5 years old and thriving. By allowing a flexible yet harmonized definition of metadata schemas, it has offered a robust common framework for consolidating the scattered landscape of resource descriptions in the LRT community, without trying to impose/prescribe one schema to cover all the resources (which seems futile in the light of the variety of resources to be described).
    6262
    63 A look into the data domain shows that the fundamental concept of a flexible metamodel with integrated semantic layer is being taken up by the community. Metadata modellers are increasingly making use not only of the infrastructure, but are also reusing the modelling work done so far.
     63A look into the data domain shows that the basic concept of a flexible metamodel with integrated semantic layer is being taken up by the community. Metadata modellers are increasingly making use not only of the infrastructure, but are also reusing the modelling work done so far.
    6464
    6565In this paper, we first -- for methodical foundation -- briefly summarize previous work, then give a short overview of the current status of the infrastructure both on the schema and instance level.
     
    7171\label{table:dev}
    7272\begin{center}
    73   \begin{tabular}{ l | r | r | r | r }
     73  \begin{tabular}{ l | r | r | r | r | r}
    7474    \hline
    75 date     & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\
     75     & 2011-01 & 2012-06 & 2013-01 & 2013-06  & 2014-01 \\
    7676    \hline
    77 Profiles & 40 & 53 & 87 & 124 \\
    78 Components & 164 & 298 & 542 & 828 \\
     77Profiles & 40 & 53 & 87 & 124 &  158\\
     78Components & 164 & 298 & 542 & 828 & 1110 \\
    7979%Expanded Components & 1055 & 1536 & 2904 & 5757 \\
    80 Elements & 511 & 893 & 1505 & 2399 \\
     80Elements & 511 & 893 & 1505 & 2399 & 3101 \\
    8181% Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
    82 Distinct data categories & 203 & 266 & 436 & 499 \\
    83 Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
    84 Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
     82Distinct data categories & 203 & 266 & 436 & 499 & 737 \\
     83% Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
     84Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% & 24,2\%\\
    8585% Components with DCs & 28 & 67 & 115 & 140 \\
    8686    \hline
     
    9393\section{Previous work}\label{lit}
    9494%
    95 Our task of determining similarity between schemas is a variant of the schema/ontology matching problem. % -- trying to find correspondences between two schemas.
     95Our task of determining similarity between schemas can be formulated as the schema/ontology matching problem. % -- trying to find correspondences between two schemas.
    9696There is a plethora of work on methods and technology in the field of \emph{schema and ontology matching} as witnessed by a sizable number of publications providing overviews, surveys and classifications of existing work %\cite{Kalfoglou2003,Shvaiko2008,Noy2005_ontologyalignment,Noy2004_semanticintegration,Shvaiko2005_classification}
    9797(\cite{Kalfoglou2003,Noy2005_ontologyalignment,shvaiko2012ontology,amrouch2012survey} and more).
     
    129129%%\end{align*} \end{defcap}
    130130
    131 One inspiration for this work was also the well-known LOD cloud\url{\footnote{http://lod-cloud.net/}} \cite{Cyganiak2010}.
     131One inspiration for this work was also the well-known LOD cloud\footnote{\url{http://lod-cloud.net/}} \cite{Cyganiak2010}.
    132132
    133133%
     
    141141\section{Current status of the joint CMD Domain}
    142142%
    143 %In the following section, we give an overview of the current status in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
    144 
    145 \subsubsection{CMD Profiles }
     143In the following section, we give an overview of the current status in the CMD domain, both on the schema level, i.e. with regard to the defined profiles and data categories used, as well as on the instance level, the actual CMD records.
     144
     145\subsection{CMD Profiles }
    146146In the CR 133\footnote{All numbers are as of 2013-09 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev} shows the development of the CR and DCR population over time.
    147147
     
    150150
    151151
    152 \subsubsection{Instance Data}
     152\subsection{Instance Data}
    153153
    154154The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
     
    229229
    230230\subsection{SMC browser}
    231 The technological base for the presented method is the \textit{SMC browser}\footnote{\url{http://clarin.aac.ac.at/smc-browser}}, a web application being developed by the CMDI team, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
     231The technological base for the presented method is the \textit{SMC browser}\footnote{\url{http://clarin.oeaw.ac.at/smc-browser}}, a web application being developed by the CMDI team, that lets the metadata modeller explore the information about profiles, components, elements and the usage of DCs as an interactive graph. This allows for example to examine the reuse of components or DCs in different profiles. The graph is accompanied by statistical information about individual `nodes', e.g., counting how many elements a profiles contains, or in how many profiles a DC is used.
    232232
    233233\subsection{Basic approach}
    234234The basic idea for constructing the CMD cloud is to
    235 1) collect the size of each profile (as the number of components and elements, or number of distinct data categories used); 2) compute the pairwise similarity ratio between the profiles based on some similarity measure; 3) generate a graph with profiles as nodes and the pairwise similarity relation expressed as edges between them.
    236 The size of the nodes in the graph reflects the size of the profile as computed before.
     2351) collect the size of each profile (as the number of components and elements, or number of distinct data categories used); 2) compute the pairwise similarity ratio between the profiles based on some similarity measure; 3) generate a graph with profiles as nodes and the pairwise similarity relation expressed as weighted edges between them.
     236When rendered, the size of the nodes in the graph reflects the size of the profile as computed before.
    237237The absolute number of matching identities is expressed as edge weight and the similarity ratio
    238238as \emph{link strength} (inversely proportional to link distance), drawing more similar profiles nearer together.
     
    240240
    241241\subsection{Similarity ratio}
    242 In the first level of this experiment, the similarity ratio is based on the most reliable information, the reuse of % components and/or
     242At the core of the discussed method is the concept of similarity between entities and the challenge how to operationalize it.
     243In the initial step, the similarity ratio is based on the most reliable information, the reuse of % components and/or
    243244data categories, computed as the average of the quotients of matching distinct data categories for each of the two profiles.
    244245
     
    250251
    251252Note though, that there is a number of other features and formulas that can be used to assess the similarity of two schemas (structures) (cf. \ref{ext}). 
     253
    252254
    253255%and several factors that need to be considered even with this presumably simple feature:
     
    267269%\end{description}
    268270
    269 Initial results showed that there is a very high degree of interconnectedness in the generated graph, (There are 7.835 links between the 157 profiles. A fully connected graph would have 12.403 edges.) resulting from the fact, that every profile shares at least one or two data categories with many other profiles. However, besides making the resulting graph illegible and difficult to lay out, such a result is also not a good answer to the question of similarity. Therefore a threshold has to be introduced to only consider links above a certain similarity ratio.
    270 
    271 \subsection{Result}
    272 As SMC browser allows to select different subgraphs and adapt layout options, figure \ref{fig:CMDcloud} depicts just one possible visual output of the analysis. The graph shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles.
    273 
     271
     272\subsection{Results}
     273The basic result is the graph of profiles with links based on their similarity. There are various ways to render this information.
     274As SMC browser allows to select different subgraphs and adapt layout options, figure \ref{fig:CMDcloud} depicts just one possible visual output of the analysis. This view shows nicely the clusters of strongly related profiles in contrast to the greater distances between more loosely connected profiles. SMC Browser also features alternative more detailed views that allow to detect visually which components and data categories are shared by which profiles. In a way a zoom in on the links between the nodes in the CMD cloud.
     275
     276The generated graph manifests a very high degree of interconnectedness in the generated graph (There are 7.835 links between the 157 profiles. A fully connected graph would have 12.403 edges.) resulting from the fact, that every profile shares at least one or two data categories with many other profiles. However, besides making the rendered graph illegible and difficult to lay out, such a result is also not a good answer to the question of similarity. Therefore a threshold was introduced to only consider links above a certain similarity ratio.
     277
     278\subsection{Applications}
     279The SMC Browser and CMD cloud were developed primarily for assisting the task of metadata modelling. A modeller can get a quick overview of the existing profiles, their structure and their interrelations, allowing her to choose the most suitable one for describing the resources at hand.
     280
     281When enriched with statistical information about instance data it can also serve as an alternative advanced interface for exploring the joint CLARIN metadata domain. It will offer the much needed 'big picture' for this huge heterogeneous collection of resources, an intuitively comprehensible visualization of its complex interrelations. This makes the tool also applicable for the metadata curation task, allowing to easily recognize structures and values that are being reused often ('hot spots') in contrast to outliers ('weak links'). With appropriate linking established the user can get from the structural overview (graph) directly to the corresponding records.
    274282
    275283\begin{figure*}
    276284\begin{center}
    277 \hspace{-0.1\textwidth}\includegraphics[width=1.1\textwidth]{just_profiles_9}
     285%\hspace{-0.1\textwidth}
     286\includegraphics[width=\textwidth]{just_profiles_9}
    278287\end{center}
    279288\caption{A graph view of the similarity relations between CMD profiles (\textit{threshold=0.6})}
     
    281290\end{figure*}
    282291
    283 \subsection{Possible extensions}
     292\subsection{Planned extensions}
    284293\label{ext}
    285 There are a number of further factors, that could be taken into account, when computing the profiles similarity. 
    286 The obvious next step is to consider the component reuse. Applying the relations between data categories as defined in Relation Registry would further grow the similarity ratios. Also, we need to cater for profiles with little data categories coverage. This can be resolved by including the data-category-coverage-ratio into the calculation. One could compute the graph on the basis of the instance data, with the node size representing the number of instances and edge width the amount of data in the shared data categories.
    287 
    288 We also plan to adopt more sophisticated approaches to compute entity and aggregated schema similarity as proposed in \cite{ehrig2004qom,Ehrig2006}.
     294There are a number of further factors, that can be taken into account, when computing the profiles similarity. 
     295The obvious next step is to consider the component reuse. Applying the relations between data categories as defined in Relation Registry would further raise the similarity ratios. Also, we need to cater for profiles with little data categories coverage. This can be resolved by including the data-category-coverage-ratio into the calculation.
     296
     297We also plan to adopt more sophisticated approaches to compute entity and aggregated schema similarity as proposed in \cite{ehrig2004qom,Ehrig2006}, like string or structural similarity between 'nodes'.
     298
     299A very important planned addition opening a whole new field of applications is to integrate statistical information about instance data into the generation of the graph. In the 'instance'-mode node size would represent the number of instances for given profile and edge width the amount of data in the shared data categories.  On instance level, also the ratio of shared values between fields/elements  could be computed and used as another similarity indicator (though computationally very demanding).
     300
    289301
    290302\section{Conclusions} % and Future Work is already in the extensions
Note: See TracChangeset for help on using the changeset viewer.