source: SMC4LRT/chapters/Introduction.tex @ 3551

Last change on this file since 3551 was 3551, checked in by vronk, 11 years ago

intermediate version - ongoing work on introduction

File size: 11.6 KB
Line 
1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2\chapter{Introduction}
3\label{ch:intro}
4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5
6\section{Motivation / problem statement}
7
8While in the Digital Libraries community a consolidation already took place and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (Chapter \ref{ch:data} analyses the disparity in the data domain.)
9
10This situation has been identified by the community and numerous standardization initiatives had been undertaken. The process has gained a new momentum thanks to large framework programmes introduced by the European Commission aimed at fostering the development of common large-scale international research infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars by providing a common harmonized architecture for accessing and working with Language Resources and Technology (LRT). One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:CMDI}) -- a distributed system consisting of multiple interconnected modules aimed at creating and providing metadata for LRT in a coherent harmonized way.
11
12This work discusses one module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogeneity of the resource descriptions, without the reductionist approach of imposing one common description schema for all resources.
13
14\section{Main Goal}
15
16The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of LRT, henceforth referred to as \xne{semantic search}, distincting it from the necessary underlying preprocessing, referred to as \xne{semantic mapping}.
17
18The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
19that also translate into three corresponding subgoals:
20
21\begin{description} 
22\item[crosswalk] link related fields in different metadata formats
23\item[interpret] translate string labels in field values to semantic entities
24\item[visualize] provide appropriate means to explore the domain data.
25\end{description}
26
27The work can further be divided along the schema -- instance duality. Figure \ref{fig:master_outline} spans the conceptual space of this work and depicts the relations between individual subgoals.
28
29\begin{figure*}[!ht]
30\begin{center}
31%\includegraphics[width=1\textwidth]{images/SMC-export_language_custom_v2c.pdf}
32\includegraphics{images/master_outline.png}
33\end{center}
34\caption{The conceptual space of this work}
35\label{fig:master_outline}
36\end{figure*}
37%\input{images/master_outline.eps_tex}
38
39\subsubsection*{Crosswalk service}
40Semantic interoperability has been one of the main concerns addressed by the CMDI and appropriate provisions were weaved into the underlying meta-model as well as all the modules of the infrastructure. The task of the crosswalk service -- the primary part of the SMC module -- is to collect the relevant information maintained in the registries of the infrastructure and process it to generate mappings, i.e. \emph{crosswalks} between fields in heterogeneous metadata schemas that can serve as basis for concept-based search.
41
42Thus, the goal is not primarily to produce the crosswalks but rather to develop the service serving existing ones.
43
44\subsubsection*{Concept-based query expansion}
45
46Once the crosswalks are available, they can be used to rewrite user queries (or to generate appropriate search indexes), so that they match related fields across heterogeneous metadata schemas resulting in higher recall when searching.
47
48\paragraph{Example} 
49Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be \emph{expanded} to
50all the semantically near fields (\emph{concept cluster}), that are however labelled (or even structured) differently in other schemas like:
51
52\begin{quote} 
53\concept{resourceTitle, BookTitle, tei:titleStmt, Corpus/GeneralInfo/Name}
54\end{quote}
55
56while other fields, labeled with the same (sub)strings but with different semantics shouldn't be considered:
57
58\begin{quote}
59\concept{Project/Title, Organisation/Name, Country/Name}
60\end{quote}
61
62\subsubsection*{Semantic interpretation}
63
64The problem of different labels for semantically similar or even identical entities is even more so virulent on the level of individual values in the fields of the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly exhaustively enumerated. This leads to a chronically inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to propose a mechanism to map (string) values in selected fields to entities defined in corresponding vocabularies.
65
66\subsubsection*{Ontology-driven data exploration}
67
68Based on the results of the previous parts of the work -- crosswalks and semantic interpretation -- the discussed dataset can be expressed as one big ontology. Consequently, semantic web technologies can be applied giving the user new means of \emph{exploring the dataset} through semantic resources.
69
70\paragraph{Example}
71Ontology-driven search -- Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external interlinked semantic resources.
72
73\subsubsection*{Visualization}
74Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means for exploration of and interaction with it. Hence this subgoal aiming at exploring ways of visualizing the data at hand.
75
76\section{Method}
77We start with examining the existing data and with the description of the existing infrastructure in which this work is embedded.
78
79Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
80
81Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories.
82
83This theoretical part will be accompanied by a prototypical implementation as proof of concept.
84
85In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
86
87In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the module -- rather than trying to establish final, accomplished alignment of the schemas. Although a tentative mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating further, more comprehensive mappings.
88
89In fact, due to the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on ``soft'' dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively influence the recall/precision ratio of the search results.
90
91\begin{note}
92A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
93
94especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT).
95\end{note}
96
97Serving the second subgoal, semantic interpretation on the instance level, we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
98semantic resources (controlled vocabularies, ontologies).
99Once the dataset is expressed in RDF, it can be exposed via a semantic web application and publicized as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
100
101A separate usability evaluation of the semantic search is indicated examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
102
103\section{Expected Results}
104
105The main result of this work will be the \emph{specification} of the two modules \xne{concept-based search} and the underlying \texttt{crosswalk service}.
106This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
107and the results and findings of the \emph{evaluation}.
108
109Another result of the work will be the original dataset expressed as RDF interlinked with existing external resources (ontologies, knowledge bases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}.
110
111\begin{description}
112\item [Crosswalk service] specification and proof of basic implementation of the module
113\item [Concept-based search] design of the query expansion and integration with search engines
114\item [Visualization] design of an application for interactive exploration of the concerned dataset
115\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
116\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets, ontologies, knowledge bases
117\end{description}
118
119\section{Structure of the work}
120The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the terms and abbreviations used in the work.
121
122In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work.
123
124The main part of the work is found in chapters \ref{ch:design} and \ref{ch:design-instance} laying out the design of the software module, the proposal how to modell the data in RDF respectively.
125
126The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
127
128\section{Keywords}
129
130Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, LinkedData
131Fuzzy Search, Visual Search?
132
133Language Resources and Technology, LRT/NLP/HLT
134
135Ontology Visualization
Note: See TracBrowser for help on using the repository browser.