Ignore:
Timestamp:
12/01/13 19:04:51 (11 years ago)
Author:
vronk
Message:

minor orthographic corrections

File:
1 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Design_SMCschema.tex

    r3776 r4117  
    11
    2 \chapter{System design -- concept-based mapping on schema level}
     2\chapter{System Design -- Concept-based Mapping on Schema Level}
    33\label{ch:design}
    44
     
    66
    77We start by drawing an overall view of the system, introducing its individual components and the dependencies among them.
    8 In the next section, the internal data model is presented and explained. In section \ref{sec:cx} the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx} we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser} an advanced interactive user interface for exploring the CMD data domain is proposed.
     8In the next section, the internal data model is presented and explained. In section \ref{sec:cx}, the design of the actual main service for serving crosswalks is described, divided into the interface specification and notes on the actual implementation. In section \ref{sec:qx}, we elaborate on a search functionality that builds upon the aforementioned service in terms of appropriate query language, a search engine to integrate the search in and the peculiarities of the user interface that could support this enhanced search possibilities. Finally, in section \ref{smc-browser}, an advanced interactive user interface for exploring the CMD data domain is proposed.
    99
    1010\section{System Architecture}
     
    1414\begin{figure*}
    1515\includegraphics[width=0.8\textwidth]{images/SMC_modules.png}
    16 \caption{The component view on the SMC - modules and their inter-dependencies}
     16\caption{The component view on the SMC - modules and their interdependencies}
    1717\label{fig:smc_modules}
    1818\end{figure*}
     
    3131The component diagram in \ref{fig:smc_modules} depicts the dependencies between the components of the system. The \xne{crosswalk service} uses the set of XSL-stylesheets \xne{smc-xsl} and accesses the CMDI registries: \xne{Component Registry}, \xne{ISOcat DCR} and \xne{RELcat} to retrieve the data. It exposes an interface \xne{cx} to be used by third party applications. The \xne{query expansion} module uses the crosswalk service to rewrite queries, also exposing a corresponding API \xne{qx}.
    3232
    33 \xne{SMC Browser} consists of two parts the \xne{smc-stats} and \xne{smc-graph} and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
     33\xne{SMC Browser} consists of two parts, the \xne{smc-stats} and \xne{smc-graph}, and also uses the set of stylesheets for processing the data. \xne{smc-graph} is build on top of a library for interactive visualization of graphs.
    3434
    3535For broader context see the reference architecture diagram in Figure \ref{fig:ref_arch}.
    3636
    37 \section{Data model}
     37\section{Data Model}
    3838
    3939Before we get to the definition of the actual service, we define the internal data model, divided into of two parts:
     
    4747In this section, we describe \var{smcIndex} -- the data type to denote indexes used by the components of the system internally, as well as input and output on the interfaces.
    4848
    49 An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms, that may not contain whitespaces.
     49An \var{smcIndex} is a human-readable string adhering to a specific syntax, denoting a search index. The syntax is based on two main ideas drawn from existing work: a) denoting a context by a prefix is derived from the way indices are referenced in CQL-syntax\footnote{Context Query Language, \url{http://www.loc.gov/standards/sru/specs/cql.html}} (analogous to the XML-namespace mechanism, cf. \ref{cql}), e.g. \concept{dc.title} and b) on the dot-notation used in IMDI-browser\footnote{\url{http://www.lat-mpi.eu/tools/imdi}} to denote paths into structured data (analogous to XPath), e.g. \concept{Session.Location.Country}. The grammar generates only single terms that may not contain whitespaces.
    5050
    5151\begin{defcap}
     
    7373It is important to note that in general \var{smcIndex} can be ambiguous, meaning it can refer to multiple concepts, or CMD entities. This is due to the fact that the labels of the data categories and CMD entities are not guaranteed unique.
    7474Although it may seem problematic and undesirable to have an ambiguous reference, this is an intentional design decision. The labels are needed for human-readability and ambiguity can be useful, as long as one is aware of it.
    75 However there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
    76 
    77 \var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
     75However, there needs to be also the possibility to refer to data categories or CMD entities unambiguously. Therefore, the syntax also allows to reference indexes by the corresponding identifier. Following are some explanations to the individual constituents of the grammar:
     76
     77\var{dcrID} is a shortcut referring to a data category registry. Next to \xne{ISOcat}, other registries can function as a DCR, in particular, the \xne{dublincore} set of metadata terms. \var{datcatLabel} is the human-readable name of given data category (e.g. \concept{telephoneNumber}). In the case of \xne{ISOcat} data categories the verbose descriptor \code{mnemonicIdentifier} is used. However, despite its name, it is not guaranteed unique. Therefore, \var{datcatID} has to be used if a data category shall be referenced unambiguously. For \xne{dublincore} terms no such distinct identifier and label exist, the concepts are denoted by the lexical term itself, which is unique within the \concept{dublincore} namespace.
    7878
    7979\var{profile} is reference to a CMD profile. Again, it can be either the name of the profile \var{profileName} or -- for guaranteed unambiguous reference -- its identifier \var{profileId} as issued by the Component Registry (e.g. \var{clarin.eu:cr1:p\_1272022528363} for \concept{LexicalResourceProfile}). Even if a profile is referenced by its identifier it may and should be prefixed by its name to still ensure human-readability. Or, seen the other way round, the name is disambiguated by suffixing it with the identifier:
     
    8585
    8686%\noindent
    87 \var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
     87\var{dotPath} allows to address a leaf element (\concept{Session.Actor.Role}), or any intermediary XML element corresponding to a CMD component (\concept{Session.Actor}) within a metadata description. This allows to easily express search in whole components, instead of having to list all individual fields. The paths don't need to start from the root entity (the profile), they can reference any subtree structure. However, longer paths are often needed for more specific references, e.g. instead of \concept{Name} one could say \concept{Actor.Name} vs. \concept{Project.Name} or even \concept{Session.Actor.Name} vs. \concept{Drama.Actor.Name}. Still this mechanism does not guarantee unique references, it only allows to specify context and thus narrow down the semantic ambiguity.
    8888
    8989\subsection{Terms}
     
    9595\subsubsection{Type \code{Term}}
    9696
    97 \code{Term} is a polymorph data type, that can have different sets of attributes depending on the type of data it represents.
     97\code{Term} is a polymorph data type that can have different sets of attributes depending on the type of data it represents.
    9898
    9999\begin{table}[h]
    100 \caption{Attributes of \code{Term} when encoding data category}
     100\caption{Attributes of \code{Term} when encoding data category (enclosed in \code{Concept})}
    101101\label{table:terms-attributes-datcat}
    102102 \begin{tabu}{ p{0.1\textwidth} p{0.4\textwidth} >{\footnotesize}X }
     
    104104\rowfont{\itshape\small}   attribute & allowed values & sample value\\
    105105\hline
    106   \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
     106%  \var{concept-id} &  PID given by DCR  & \code{isocat:DC-2522} \\
    107107  \var{set} & identifier of the DCR \emph{dcrID}  & \code{isocat} \\
    108108  \var{type} &  one of ['id', 'label', 'mnemonic'] & \code{id}, \code{label}\\
     
    223223
    224224\subsubsection{Type \code{Relation}}
    225 As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated, that contain more than two equivalent concepts.
     225As explained in \ref{def:rr}, the framework allows to express relations between concepts or data categories. These are maintained in the Relation Registry and fetched from there by SMC upon initialization. Type \code{Relation} is the internal representation of this information. It has attribute \var{type} indicating the type of the relation as delivered by RR (currently only \code{sameAs}).  The relations of one relation set are enclosed in \code{Termset} element carrying the identifier of the relation set. The content of \code{Relation} is a sequence of at least two \code{Concepts}. Currently, it is always exactly two \code{Concepts} corresponding to the pairs delivered from RR, but by traversing the equivalence relation concept clusters (or ``cliques'') could be generated that contain more than two equivalent concepts.
    226226
    227227% role="about"
     
    261261
    262262%%%%%%%%%%%%%%%%%%%%%%
    263 \section{cx -- crosswalk service}
     263\section{cx -- Crosswalk Service}
    264264\label{sec:cx}
    265265
    266 The crosswalk service offers the functionality, that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
     266The crosswalk service offers the functionality that was understood under the term \textit{Semantic Mapping} as conceived in the original plans of the Component Metadata Infrastructure. Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure.
    267267Consequently, the infrastructure has also foreseen this dedicated module, \emph{Semantic Mapping}, that exploits this mechanism to find \textbf{corresponding fields in different metadata schemas}.
    268268
    269269The task of the crosswalk service is to collect the relevant information maintained in the registries of the infrastructure and process it to generate the mappings, or \textbf{crosswalks} between fields in heterogeneous metadata schemas. These crosswalks can be used by other applications representing the base for concept-based search in the heterogeneous data collection of the joint CLARIN metadata domain (cf. \ref{sec:qx}).
    270270
    271 The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts, that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
     271The core means for semantic interoperability in CMDI are the \emph{data categories} (cf. \ref{def:DCR}), well-defined atomic concepts that are supposed to be referenced in schemas by annotating fields to unambiguously indicate their intended semantics. Drawing upon this system, the crosswalks are not generated directly between the fields of individual schemas by some kind of matching algorithm (cf. \ref{lit:schema-matching}), but rather the data categories are used as reliable bridges for translation. This results in clusters of semantically equivalent metadata fields (with data categories serving as pivotal points) instead of a collection of pair-wise links between fields.
    272272
    273273\subsection{Interface Specification}
     
    455455The documentation of the XSLT stylesheets and the build process is found in appendix \ref{sec:smc-xsl-docs}.
    456456
    457 The service is implemented as a RESTful service, however only supporting the GET operation, as it operates on a data set, that the users cannot change directly. (The changes have to be performed in the upstream registries.)
     457The service is implemented as a RESTful service, however, only supporting the GET operation, as it operates on a data set that the users cannot change directly. (The changes have to be performed in the upstream registries.)
    458458
    459459
     
    479479\item[\xne{termets}] a list of all available Termsets compiled from the CMD profiles, and available DCRs; for \xne{ISOcat} a termset is generated for every available language
    480480\item[\xne{cmd-terms}] a flat list of \code{Term} elements representing all components and elements in all known profiles; grouped in \code{Termset} elements representing the profiles
    481 \item[\xne{cmd-terms-nested}] as above, however the \code{Term} elements are nested reflecting the component structure in the profile
     481\item[\xne{cmd-terms-nested}] as above, however, the \code{Term} elements are nested reflecting the component structure in the profile
    482482\item[\xne{dcr-terms}] a list of \code{Concept} elements representing the data categories with nested \code{Term} elements encoding its properties (\code{id, label}
    483483\item[\xne{dcr-cmd-map}] the main inverted index -- a list of concepts as in \xne{dcr-terms}, but with additional \code{Term} elements included in the \code{Concept} elements representing the CMD components or elements corresponding to given data category (cf. listing \ref{lst:dcr-cmd-map})
    484 \item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute
     484\item[\xne{rr-terms}] Additional index generated based on the relations between data categories as defined in the Relation Registry; the \code{Concept} elements representing the pair of related data categories are wrapped with a \code{Relation} element (with a \code{@type} attribute).
    485485\end{description}
    486486
    487487\subsubsection{Operation}
    488 For the actual service operation a minimal application has been implemented, that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
     488For the actual service operation a minimal application has been implemented that accesses the cached internal datasets and optionally applies XSL stylesheets for post-processing depending on requested format.
    489489The application implements the interface as defined in \ref{def:cx-interface} as a XQuery module based on the \xne{restxq} library within an \xne{eXist} XML database.
    490490
     
    495495Also, use of \emph{other than equivalence} relations will necessitate more complex logic in the query expansion and accordingly also more complex response of the crosswalk service, either returning the relation types themselves as well or equip the list of indexes with some kind of similarity ratio.
    496496
    497 \section{qx -- concept-based search}
     497\section{qx -- Concept-based Search}
    498498\label{sec:qx}
    499499To recall, the main goal of this work is to enhance the search capabilities of the search engines serving the metadata.
    500 In this section we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
     500In this section, we want to explore how this shall be accomplished, i.e. how to bring the enhanced capabilities to the user.
    501501
    502502The emphasis lies on the query language and the corresponding query input interface. Crucial aspect is the question how to integrate the additional processing, i.e. how to deal with the even greater amount of information in a user-friendly way without overwhelming the user, while still being verbose about the applied processing on demand for the user to understand how the result came about and even more important, to allow the user to manipulate the processing easily.
    503503
    504 Note, that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is dealt with in \ref{semantic-search}.
     504Note that this chapter deals only with the schema level, i.e. the expansion here pertains only to the indexes to be searched in, not to the search terms. The instance level is tackled in \ref{sec:values2entities} (and also there only rather superficially).
    505505
    506506Note, also that \emph{query expansion} yet needs to be distinguished from \emph{query translation}, a task to express input query in another query language (e.g. CQL query expressed as XPath).
     
    509509\label{cql}
    510510As base query language to build upon the \emph{Context Query Language} (CQL) is used, a well-established standard, designed with extensibility in mind.
    511 CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50\cite{Lynch1991}, which is very widely spread in the library networks.
    512 It was introduced 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
    513 transfered from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012\cite{OASIS2012sru}.)
     511CQL is the query language defined as part of \xne{SRU/CQL} -- the communication protocol introduced by the Library of Congress. SRU is a simplified, XML- and HTTP-based successor to Z39.50 \cite{Lynch1991}, which is very widely spread in the library networks.
     512It was introduced in 2002 \cite{Morgan04}. The maintenance of SRU/CQL has been
     513transferred from LoC to OASIS in 2012, and OASIS released a first version of the protocol as Committee Specification in April 2012 \cite{OASIS2012sru}.)
    514514
    515515Coming from the libraries world, the protocol has a certain bias in favor of bibliographic metadata.
     
    525525The query language part (CQL - Context Query Language) defines a relatively complex and complete query language.
    526526The decisive feature of the query language is its inherent extensibility allowing to define own indexes and operators.
    527 In particular, CQL introduces so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
     527In particular, CQL introduces the so-called \emph{context sets} -- a kind of application profiles that allow to define new indexes or even comparison operators in own namespaces. This feature can be employed to integrate the dynamic indexes adhering to the \var{smcIndex} syntax as proposed in \ref{def:smcIndex}.
    528528
    529529The SRU/CQL protocol has also been adopted by the CLARIN community as base for a protocol for federated content search\furl{http://clarin.eu/fcs} (FCS) \cite{stehouwer2012fcs}, which is another argument to use this protocol for metadata search as well,  given the inherent interrelation between metadata and content search.
     
    541541
    542542%\begin{note}
    543 Alternatively to the -- potentially costly -- on the fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
     543Alternatively to the -- potentially costly -- on-the-fly expansion, the concept-based equivalence clusters could be applied already during the indexing of the data. That means that ``virtual'' search indexes are defined for individual data categories, in which values from all metadata fields annotated with given data category are indexed. Indeed, this approach is already being applied in the search applications VLO and Meertens Institute Search Engine (cf. \ref{cmdi_exploitation}).
    544544%\end{note}
    545545
    546 \subsection{SMC as module for Metadata Repository}
     546\subsection{SMC as Module for Metadata Repository}
    547547
    548548As a concrete proof of concept the functionality of SMC has been integrated into the Metadata Repository, another module of the CMDI providing all the metadata records harvested within the CLARIN joint metadata domain (cf. \ref{cmdi_exploitation}).
    549549
    550 Metadata repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module, that provides a user interface widget for formulating the query.
     550Metadata Repository itself is implemented as custom project within \xne{cr-xq}, a generic web application developed in XQuery running within the eXist XML-database. \xne{cr-xq} is developed by the author as part of a larger publication framework \xne{corpus\_shell}. As can be seen in figure \ref{fig:modules-mdrepo} within \xne{cr-xq} the crosswalk service -- implemented as the \xne{smc-xq}  module -- is used by the search module \xne{fcs}, which is in turn used by the \xne{query\_input} module that provides a user interface widget for formulating the query.
    551551
    552552\begin{figure*}
    553553\begin{center}
    554554\includegraphics[width=0.8\textwidth]{images/modules_mdrepo-smc.png}
    555 \caption{The component view on the SMC - modules and their inter-dependencies}
     555\caption{The component diagram of the integration of SMC as module within the Metadata Repository}
    556556\label{fig:modules-mdrepo}
    557557\end{center}
     
    561561\subsection{User Interface}
    562562
    563 A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically a an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
     563A starting point for our considerations is the traditional structure found in many (``advanced'') search interfaces, which is basically an array of tuples of index, comparison operator, terms combined by a boolean operator. This is reflected in the CQL syntax with the basic \var{searchClause} and the boolean operators to formulate more complex queries.
    564564\begin{definition}{Generic data format for structured queries}
    565565 < index, operation, term, boolean >+
     
    581581
    582582\noindent
    583 Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions.
    584 Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labeling the fields of the results, or when providing facets to drill down the search.
    585 
    586 A fundamentally different approach is the "content first" paradigm, that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is, that the suggestions are typed, so that the user is informed from which index given term comes (\concept{person}, \concept{place}, etc.)
    587 
    588 Combining the two approaches, we could arrive at a ``smart'' widget a input field with on the fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
     583Using data categories from ISOcat as search indexes brings about -- next to solid semantic grounding -- the advantage of multilingual labels and descriptions/definitions. Although we concentrate on query input, the use of indexes has to be consistent across the user interface, be it in labelling the fields of the results, or when providing facets to drill down the search.
     584
     585A fundamentally different approach is the "content first" paradigm that, similiar to the notorious simple search fields found in general search engines, provides suggestions via autocompletion on the fly, when the user starts typing any string. The difference is that the suggestions are typed, so that the user is informed, from which index given term comes (\concept{person}, \concept{place}, etc.)
     586
     587Combining the two approaches, we could arrive at a ``smart'' widget consisting of one input field with on-the-fly query parsing and contextual autocomplete. Though even such a widget would still share the underlying data model of \xne{CQL} in combination with \var{smcIndexes}.
    589588
    590589
     
    595594As the CMD dataset keeps growing both in numbers and in complexity, the call from the community to provide enhanced ways for its exploration gets stronger.  In the following, some design considerations for an application to answer this need are proposed.
    596595
    597 While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
     596While the Component Registry (cf. \ref{def:CR}) allows to browse, search and view existing profiles and components, it is not possible to easily find out, which components are reused in which profiles and also which data categories are referenced by which elements. However, this kind of information is crucial during profile creation as well as for curation of the existing profiles, as it enables the data modeller to recognize a) which components and data categories are those most often used, indicating their adoption and popularity within the community and b) the thematic contexts in which individual components are used, providing a hint about their appropriateness for given research data.
    598597
    599598\subsection{Design}
     
    615614
    616615\subsubsection{Requirements}
    617 Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious, that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
     616Given the size of the data set (currently more than 4.000 nodes and growing) it is obvious that it is not possible to overview the whole of the graph in one view. Thus, a general essential requirement is to be able to select and view subgraphs by various means.
    618617
    619618In a basic scenario, user looks for possibly reusable profiles or components, based on some common terms associated with the type of data to be described (e.g. \code{"corpus"}). If the search yields matching profiles or components, the user should be able to view the whole structure of the profiles, explore the definitions for individual components and see which data categories are being referenced for semantic grounding. Furthermore, it has to be possible to view multiple profiles concurrently, in particular to be able to see the components or data categories they share and, vice versa, in which profiles a given data category is referenced.
     
    658657\end{quotation}
    659658
    660 Especially remarkable feature is the possibility to add custom constraints, that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
     659Especially remarkable feature is the possibility to add custom constraints that are accomodated with the constraints imposed by the base algorithm. This enables flexible customization of the layout, still harnessing the power of the underlying layout algorithm. At the same time this is a quite challenging feature to master, as with different constraint affecting the layout algorithm, it is at times difficult to understand the impact of a specific constraint on the layout.
    661660
    662661\subsubsection{Data preprocessing}
    663662\label{smc-browser-data-preprocessing}
    664 The application operates on a set of static XHTML and JSON data files, that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
     663The application operates on a set of static XHTML and JSON data files that are created in a preprocessing step and deployed with the application. The preprocessing consists of a series of XSLT transformations (cf. figure \ref{fig:smc_processing}), starting from the internal datasets generated during the initialization (cf. \ref{smc_init}). The HTML output for \xne{smc-stats} is generated in two steps (\var{track S})  via an intermediate internal generic XML format for representing tabular data. The JSON data for the \xne{smc-graph} as expected by the \xne{d3} library is also generated in two steps (\var{track G}). First, a XML representation of the graph is generated from the data (\xne{terms2graph.xsl}), on which a generic XSLT-transformation is applied (\xne{graph\_json.xsl}) transforming the XML graph  into required JSON format. In fact, this track is run multiple times generating different variants of the graph, featuring different aspects of the dataset:
    665664
    666665\begin{description}
     
    677676\end{description}
    678677
    679 Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However soon it became obvious, that the graph is getting to huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
    680 
    681 To The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
     678Additionally, a detour pass (\var{track D}) is executed, in which the graph is also transformed into the DOT format and run through the \xne{Graphviz dot} tool to get a SVG representation of the graph. In an early stage of development, this was actually the only processing path. However, soon it became obvious that the graph is getting too huge to be displayed in its entirety. Figure \ref{fig:cmd-dep-dotgraph} displays an old version of such a dot generated graph visualization. Currently, the \xne{dot} output is only used as input for the final graph data, providing initialization coordinates for the nodes in the \code{dot}-layout.
     679
     680The graph is constructed from all profiles defined in the Component Registry and related datasets. To resolve (multilingual) name and description of data categories referenced in the CMD elements definitions of referenced data categories from DublinCore and ISOcat are fetched.
    682681
    683682
     
    698697
    699698As proposed in the design section, the starting point when using the SMC browser is the node list on the left, listing all nodes grouped by type (profiles, components, elements, data categories) and sorted alphabetically. This list can be filtered by a simple substring search which is important, as already now there are more than 4.000 nodes in the graph. Individual nodes are selected and deselected by a simple click. All selected nodes are displayed in the main graph pane represented by a circle with a label. The representation is styled by type. Based on the settings in the navigation bar (cf. figure \ref{fig:navbar}), next to the selected nodes also related nodes are displayed. The \code{depth-before} and \code{depth-after} options govern how many levels in each direction are traversed and displayed starting from the set of selected nodes. Option \code{layout} allows to select from one of available layouts -- next to the
    700 basic \code{force} layout there are also directed layouts, that are often better suited for displaying the directed graph.
     699basic \code{force} layout there are also directed layouts that are often better suited for displaying the directed graph.
    701700Other options influence the layouting algorithm (\code{link-distance}, \code{charge}, \code{friction}) and the visual representation of the nodes and edges (\code{node-size, labels, curve}).
    702701
    703 One special option is \code{graph}, that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
     702One special option is \code{graph} that allows to switch between different graphs as listed in \ref{smc-browser-data-preprocessing}.
    704703
    705704There is user documentation deployed with the application and featured in the appendix \ref{sec:smc-browser-userdocs}, where all aspects of interaction with the application (\ref{interaction}) and the options in the navigation bar (\ref{options}) are described.
     
    708707\label{smc-browser-extensions}
    709708
    710 Next to the basic setup described above, there is a number of possible additional features, that could enhance the functionality and usefulness of the discussed tool.
     709Next to the basic setup described above, there is a number of possible additional features that could enhance the functionality and usefulness of the discussed tool.
    711710
    712711\subsubsection*{Graph operations -- differential views}
     
    717716Equipped with a more flexible or modular matching algorithm (additionally to the initially foreseen identity match), the tool could visualize matches between any given schemas, not only CMD-based ones.
    718717
    719 Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information, that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
     718Also, the input format being a graph, with appropriate preprocessing the tool could visualize any structural information that is suited to be expressed as graph, like cooccurrence analysis, dependency networks, RDF data in general etc.
    720719
    721720\subsubsection*{Viewer for external data}
    722 The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set), that would allow to visualize their data in the SMC browser.
     721The above feature would be even more useful if the application would be enabled to ingest and process external data. The data can be passed either via upload or via a parameter with a URL of the data. This is especially attractive also to providers of other data and applications, who could provide a simple link in their user interface (with the data-parameter appropriately set) that would allow to visualize their data in the SMC browser.
    723722
    724723One prominent visualization application offering this feature is the geobrowser e4D\furl{http://www.informatik.uni-leipzig.de:8080/e4D/} (currently \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo}, developed in the context of the \xne{europeana connect} initiative), accepting data in KML format.
    725724
    726725\subsubsection*{Integrate with instance data}
    727 The usefulness and information gain of the application could be greatly increased by integrating the instance data. I.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
     726The usefulness and information gain of the application could be greatly increased by integrating the instance data, i.e. generate and display a variant of the graph which contains only profiles for which there is actually instance data present in the CLARIN joint metadata domain. Obviously, in such a visualization the size of data could be incorporated, in the most simple case number of records being mapped on the radius of the nodes, but there are a number of other metrics that could be applied in the visualizations.
    728727
    729728Also such a visualization could feature direct search links from individual nodes into the dataset, i.e.  from a profile node a link could lead into a search interface listing metadata records of given profile.
     
    731730
    732731%%%%%%%%%%%%%%%%%%%%%%%%%
    733 \section{Application of \emph{schema matching} techniques in SMC}
     732\section{Application of \emph{Schema Matching} Techniques in SMC}
    734733\label{sec:schema-matching-app}
    735734
     
    739738Or put in terms of the schema matching methodology, the system relies on explicitly set concept equivalences as base for mapping between schema entities. By referencing a data category in a CMD element, the modeller binds this element to a concept, making two elements linked to the same data category trivially equivalent.
    740739
    741 However this is only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
     740However, this only holds for schemas already created within the CMD framework (and even for these only to a certain degree, as will be explained later). Given the growing universe of definitions (data categories and components) in the CMD framework, the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. (User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching.) Already now, we witness a growing proliferation of components in the Component Registry and of data categories in the Data Category Registry.
    742741
    743742Let us restate the problem of integrating existing external schemas as an application of \var{schema matching} method:
    744743The data modeller starts off with existing schema \var{$S_{x}$}. The system accomodates a set of schemas\footnote{Even though within CMDI the data models are called `profiles', we can still refer to them as `schema', because every profile has an unambiguous expression in a XML Schema.} \var{$S_{1..n}$}.
    745 It is very improbable, that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
     744It is very improbable that there is a \var{$S_{y} \in S_{1..n}$} that fully matches \var{$S_{x}$}.
    746745Given the heterogeneity of the schemas present in the field of research, full alignments are not achievable at all.
    747 However thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
     746However, thanks to the compositional nature of the CMD data model, data modeller can reuse just parts of any of the schemas -- the
    748747components \var{c}. Thus the task is to find for every entity $e_{x} \in S_{x}$ the set of semantically equivalent candidate components $\{c_{y}\}$, which corresponds to the definitions of mapping function for single entities as defined in \cite{EhrigSure2004}.
    749 Given, that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
     748Given that the modeller does not have to reuse the components as they are, but can use existing components as base to create his own, even candidates that are not equivalent can be of interest, thus we can further relax the task and allow even candidates that are just similar to a certain degree (operationalized as threshold $t$ on the output of the \var{similarity} function).
    750749Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
    751750
     
    764763the mapping function could be enriched with \emph{extensional} features based on the concept clusters as delivered by the crosswalk service \ref{sec:cx}. It would be also worthwhile to test, in how far the \var{smcIndex} paths as defined in \ref{def:smcIndex} could be used as feature (compute the longest matching subpath).
    765764
    766 Although we examplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles, that though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
    767 
    768 Note, that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency pervails.
     765Although we exemplified on the case of integration of an external schema, the described approach could be applied also to the schemas already integrated in the system. Although there is already a high baseline given thanks to the mechanisms of reuse of components and data categories, there certainly still exist semantic proximities that are not explicitly expressed by these mechanisms. This deficiency is rooted in the collaborative creation of the CMD components and profiles, where individual modellers overlooked, deliberately ignored or only partially reused existing components or profiles. This can be seen on the case of multiple teiHeader profiles that, though they are modelling the same existing metadata format, are completely disconnected in terms of components and data category reuse (cf. \ref{results:tei}).
     766
     767Note that in the case of reuse of components, in the normal scenario, the semantic equivalence is ensured even though the new component (and all its subcomponents) is a copy of the old one with new identity, because the references to data categories are copied as well. Thus, by default, the new component shares all data categories with the original one and the modeller has to deliberately change them if required. But even with reuse of components scenarios are thinkable, in which the semantic linking gets broken, or is not established, even though semantic equivalency prevails.
    769768
    770769The question is, what to do with the new correspondences that would possibly be determined, when, as proposed, we would apply the schema matching on the integrated schemas. One possibility is to add a data category, if one of the pair is still one missing.
    771 However if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
     770However, if both already are linked to a data category, the data category pair could be added to the relation set in Relation Registry (cf. \ref{def:rr}).
    772771 
    773772Once all the equivalences (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
    774773This new simliarity ratios could be applied as alternative weights in the profiles-similarity graph \ref{sec:smc-cloud}.
    775774
    776 In contrast to the task described here, that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
     775In contrast to the task described here that -- restricted to matching XML schemas -- can be seen as staying in the ``XML World'',
    777776another aspect within this work is clearly situated in the Semantic Web domain and requires application of ontology matching methods -- the mapping of field values to semantic entities described in \ref{sec:values2entities}.
    778777
    779 %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
     778%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks \cite{Shvaiko2005}.
    780779
    781780
    782781
    783782\section{Summary}
    784 In this core chapter, we layed out a design for a system dealing with concept-based crosswalks on schema level.
     783In this core chapter, we laid out a design for a system dealing with concept-based crosswalks on schema level.
    785784The system consists of three main parts: the crosswalk service, the query expansion module and \xne{SMC Browser} -- a tool for visualizing and exploring the schemas and the corresponding crosswalks.
    786785In addition, we elaborated on the application of schema matching methods to infer mappings between schemas.
Note: See TracChangeset for help on using the changeset viewer.