Context Navigation

← Previous Change
Next Change →

Design_SMCschema.tex

Timestamp:

12/01/13 19:04:51 (11 years ago)

Author:

vronk

Message:

minor orthographic corrections

File:

: 1 edited

SMC4LRT/chapters/Design_SMCschema.tex (modified) (29 diffs)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Design_SMCschema.tex

-                      r3776
+                      r4117
 \chapter{System design -- concept-based mapping on schema level}
+\chapter{System Design -- Concept-based Mapping on Schema Level}
 \label{ch:design}
 …
 We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
+In the next section, the internal data model is presented and explained. In section \ref{sec:cx}, the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx}, we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser}, an advanced interactive user interface for exploring the CMD data domain is proposed.
 \section{System Architecture}
 …
 \begin{figure*}
 \includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
 \caption{The component view on the SMC - modules and their inter-dependencies}
+\caption{The component view on the SMC - modules and their interdependencies}
 \label{fig:smc_modules}
 \end{figure*}
 …
 The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}.
 \xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
+\xne{SMC Browser} consists of two parts, the \xne{smc-stats} and \xne{smc-graph}, and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
 For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
 \section{Data model}
+\section{Data Model}
 Before we get to the definition of the actual service, we define the internal data model, divided into of two parts:
 …
 In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
+An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms that may not contain whitespaces.
 \begin{defcap}
 …
 It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
 Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
+However, there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
+\var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However, despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
 \var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
 …
 %\noindent
 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
+\var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However, longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
 \subsection{Terms}
 …
 \subsubsection{Type \code{Term}}
 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
+\code{Term} is a polymorph data type that can have different sets of attributes depending on the type of data it represents.
 \begin{table}[h]
 \caption{Attributes of \code{Term} when encoding data category}
+\caption{Attributes of \code{Term} when encoding data category (enclosed in \code{Concept})}
 \label{table:terms-attributes-datcat}
  \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X }
 …
 \rowfont{\itshape\small}   attribute & allowed values & sample value\\
 \hline
   \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
+%  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
   \var{set} & identifier of the DCR \emph{dcrID}  & \code{isocat} \\
   \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
 …
 \subsubsection{Type \code{Relation}}
 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts.
+As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated that contain more than two equivalent concepts.
 % role="about"
 …
 %%%%%%%%%%%%%%%%%%%%%%
 \section{cx -- crosswalk service}
+\section{cx -- Crosswalk Service}
 \label{sec:cx}
 The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
+The crosswalk service offers the functionality that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
 Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
 The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
+The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
 \subsection{Interface Specification}
 …
 The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}.
 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
+The service is implemented as a RESTful service, however, only supporting the GET operation, as it operates on a data set that the users cannot change directly. (The changes have to be performed in the upstream registries.)
 …
 \item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language
 \item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
+\item[\xne{cmd-terms-nested}] as above, however, the \code{Term} elements are nested reflecting the component structure in the profile
 \item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label}
 \item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
+\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute).
 \end{description}
 \subsubsection{Operation}
 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
+For the actual service operation a minimal application has been implemented that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
 The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database.
 …
 Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
 \section{qx -- concept-based search}
+\section{qx -- Concept-based Search}
 \label{sec:qx}
 To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
 In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
+In this section, we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
 The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
 Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.
+Note that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is tackled in \ref{sec:values2entities} (and also there only rather superficially).
 Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
 …
 \label{cql}
 As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks.
 It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
+CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50 \cite{Lynch1991}, which is very widely spread in the library networks.
+It was introduced in 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
+transferred from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012 \cite{OASIS2012sru}.)
 Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata.
 …
 The query language part (CQL - Context Query Language) defines a relatively complex and complete query language.
 The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators.
 In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
+In particular, CQL introduces the so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
 The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well,  given the inherent interrelation between metadata and content search.
 …
 %\begin{note}
 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
+Alternatively to the -- potentially costly -- on-the-fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories, in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
 %\end{note}
 \subsection{SMC as module for Metadata Repository}
+\subsection{SMC as Module for Metadata Repository}
 As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
 Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query.
+Metadata Repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module that provides a user interface widget for formulating the query.
 \begin{figure*}
 \begin{center}
 \includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
 \caption{The component view on the SMC - modules and their inter-dependencies}
+\caption{The component diagram of the integration of SMC as module within the Metadata Repository}
 \label{fig:modules-mdrepo}
 \end{center}
 …
 \subsection{User Interface}
 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
+A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
 \begin{definition}{Generic data format for structured queries}
  < index, operation, term, boolean >+
 …
 \noindent
+Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
+Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search.
+A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
+Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
+Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labelling the fields of the results, or when providing facets to drill down the search.
+A fundamentally different approach is the "content first" paradigm that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is that the suggestions are typed, so that the user is informed, from which index given term comes (\concept{person}, \concept{place}, etc.)
+Combining the two approaches, we could arrive at a ``smart'' widget consisting of one input field with on-the-fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
 …
 As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger.  In the following, some design considerations for an application to answer this need are proposed.
 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
+While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However, this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
 \subsection{Design}
 …
 \subsubsection{Requirements}
 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
+Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
 In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced.
 …
 \end{quotation}
 Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
+Especially remarkable feature is the possibility to add custom constraints that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
 \subsubsection{Data preprocessing}
 \label{smc-browser-data-preprocessing}
 The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
+The application operates on a set of static XHTML and JSON data files that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
 \begin{description}
 …
 \end{description}
 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
 To The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
+Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However, soon it became obvious that the graph is getting too huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
+The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
 …
 As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the
 basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph.
+basic \code{force} layout there are also directed layouts that are often better suited for displaying the directed graph.
 Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}).
 One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
+One special option is \code{graph} that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
 There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
 …
 \label{smc-browser-extensions}
 Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
+Next to the basic setup described above, there is a number of possible additional features that could enhance the functionality and usefulness of the discussed tool.
 \subsubsection*{Graph operations -- differential views}
 …
 Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones.
 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
+Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
 \subsubsection*{Viewer for external data}
 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser.
+The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set) that would allow to visualize their data in the SMC browser.
 One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format.
 \subsubsection*{Integrate with instance data}
 The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
+The usefulness and information gain of the application could be greatly increased by integrating the instance data, i.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
 Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
 …
 %%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Application of \emph{schema matching} techniques in SMC}
+\section{Application of \emph{Schema Matching} Techniques in SMC}
 \label{sec:schema-matching-app}
 …
 Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
+However, this only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework, the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
 Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
 The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}.
 It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
+It is very improbable that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
 Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all.
 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
+However, thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
 components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
+Given that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
 Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
 …
 the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath).
 Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
+Although we exemplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles that, though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
+Note that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency prevails.
 The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
+However, if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
 Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
 This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}.
 In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
+In contrast to the task described here that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
 another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}.
 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
+%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
 \section{Summary}
 In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
+In this core chapter, we laid out a design for a system dealing with concept-based crosswalks on schema level.
 The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.
 In addition, we elaborated on the application of schema matching methods to infer mappings between schemas.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 4117 for SMC4LRT/chapters/Design_SMCschema.tex

Legend:

SMC4LRT/chapters/Design_SMCschema.tex

Download in other formats: