source: SMC4LRT/chapters/Introduction.tex

Last change on this file was 4117, checked in by vronk, 11 years ago

minor orthographic corrections

File size: 11.5 KB
Line 
1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2\chapter{Introduction}
3\label{ch:intro}
4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5
6\section{Motivation / Problem Statement}
7
8While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
9
10This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies (LRT) more easily available to scholars by means of a common harmonized architecture. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (CMDI, cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
11
12This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} (SMC) -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
13
14\section{Main Goal}
15
16The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the underlying processing, referred to as \xne{semantic mapping}.
17
18The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
19that also translate into three corresponding subgoals:
20
21\begin{description} 
22\item[crosswalk] link related fields in different metadata formats
23\item[interpret] translate string labels in field values to semantic entities
24\item[visualize] provide appropriate means to explore the domain data.
25\end{description}
26
27The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the dependencies between individual subgoals.
28
29\begin{figure*}[!ht]
30\begin{center}
31%\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
32\includegraphics{images/master_outline.png}
33\end{center}
34\caption{The conceptual space of this work}
35\label{fig:master_outline}
36\end{figure*}
37%\input{images/master_outline.eps_tex}
38
39\subsubsection*{Crosswalk service}
40Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search.
41
42Thus, the goal is not primarily to define new crosswalks but rather to develop a service serving existing ones.
43
44\subsubsection*{Concept-based query expansion}
45
46Once the crosswalks are available, they can be used to rewrite user queries, so that they match equivalent or similar fields across heterogeneous metadata schemas resulting in higher recall when searching.
47
48\paragraph{Example} 
49Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to
50all the semantically near fields (\emph{concept cluster}) that are however, labelled (or even structured) differently in other schemas like:
51
52\begin{quote} 
53\concept{resourceTitle, BookTitle, tei:titleStmt, Corpus/GeneralInfo/Name}
54\end{quote}
55
56The expansion cannot be solved by simple string matching, as there are other fields labelled with the same (sub)strings but with different semantics that shouldn't be considered:
57
58\begin{quote}
59\concept{Project/Title, Organisation/Name, Country/Name, LanguageName}
60\end{quote}
61
62\subsubsection*{Semantic interpretation}
63
64The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the evidence in the metadata records collected within CMDI shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
65
66\subsubsection*{Ontology-driven data exploration}
67
68Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset}.
69
70\paragraph{Example}
71Ontology-driven search -- Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external interlinked semantic resources.
72
73\subsubsection*{Visualization}
74Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means to explore and interact with it. Hence this subgoal aimed to propose ways of visualizing the data at hand.
75
76\section{Method}
77We start with examining the existing data and with the description of the existing infrastructure, in which this work is embedded.
78
79Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
80
81Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories. This theoretical part will be accompanied by a prototypical implementation as proof of concept.
82
83%In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
84
85Note that in this work, the focus lies on the actual method to generate and apply the crosswalks -- expressed in the specification and operationalized in the (prototypical) implementation of the service -- rather than trying to establish final, accomplished crosswalks between the schemas. In fact, given the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on \emph{dynamic mapping}, i.e. to enable the users to directly manipulate the level of use of the crosswalks or even apply custom crosswalks depending on their current task or research question being able to actively influence the recall/precision ratio of the search results, and essentially to modulate the semantic search space.
86
87
88Serving the second subgoal -- semantic interpretation on the instance level -- we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
89semantic resources (controlled vocabularies, ontologies).
90Once the dataset is expressed in RDF, it can be exposed via a semantic web application and published as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
91
92A separate evaluation of the usability of the proposed semantic search solution is indicated, examining the user interaction with and display of the relevant additional information in the user search interface, however, this issue can only be tackled marginally and will have to be outsourced into future work.
93
94\section{Expected Results}
95
96The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \xne{crosswalk service}.
97This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
98and the sample results. % and findings of the \emph{evaluation}.
99
100Another result of the work will be the original dataset expressed as RDF interlinked with existing external resources (ontologies, knowledge bases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/}.
101
102\begin{description}
103\item [Crosswalk service] specification and a basic implementation of the service
104\item [Concept-based search] design of the query expansion and prototypical integration with a search engine
105\item [Visualization tool] design of an application for interactive exploration of the concerned dataset
106%\item [Evaluation] evaluation results of querying the dataset comparing simple search and semantic search
107\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases
108\end{description}
109
110\section{Structure of the Work}
111The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components of the infrastructure underlying this work.
112
113The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module and a proposal how to model the data in RDF respectively.
114
115%evaluation and the
116The results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
117
118The auxiliary material accompanying the work is found in the appendix. After the administrative chapter \ref{ch:def} explaining the abbreviations and formatting conventions used throughout this work, full specifications of the used data models (\ref{ch:data-model-ref}) and data samples (\ref{ch:cmd-sample}) are listed for reference, as well as the developer and user documentation for the technical solution of this work, the SMC module (\ref{ch:smc-docs}).
119
120
121\section{Keywords}
122
123semantic interoperability -- crosswalks -- schema mapping -- metadata -- language resources and technology -- linked data -- visualization
Note: See TracBrowser for help on using the repository browser.