source: SMC4LRT/chapters/Introduction.tex @ 2697

Last change on this file since 2697 was 2697, checked in by vronk, 11 years ago

various additions

File size: 8.1 KB
Line 
1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2\chapter{Introduction}
3\label{ch:intro}
4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5
6
7        \begin{itemize}
8                \item motivation
9                \item problem statement (which problem should be solved?)
10                \item aim of the work
11                \item methodological approach
12                \item structure of the work
13        \end{itemize}
14
15
16\subsection{Problem statement}
17
18While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
19
20\todo{Need some number about the disparity in the field, number of institutes, resources, formats.}
21
22This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
23
24
25\subsection{Main Goal}
26
27This work proposes a component that shall enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
28
29Alternatively/  that allows query expansion by providing mappings between search indexes. This enables semantic search, ultimately increasing the recall when searching in metadata collections. The module builds on the Data Category Registry and Component Metadata Framework that are part of CMDI.
30
31
32Following two examples for better illustration. First a concept-based query expansion:
33Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is synonym to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
34\begin{quote}
35\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR \\ Person.Name =  Sue OR Person.FullName = Sue}
36\end{quote}
37And second, an ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
38
39Such \textbf{semantic search} functionality requires a preprocessing step, that produces the underlying linkage both between categories/concepts and on the instance level. We refer to this task as \textbf{semantic mapping}, that shall be realized by corresponding \texttt{Semantic Mapping Component}. In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the component -- rather than trying to establish a final, accomplished alignment. Although a tentative, na\"ive mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating the actual sensible mappings usable for real tasks.
40
41In fact, due to the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on ``soft'' dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively manipulate the recall/precision ratio of the search results. This entails an examination of user interaction with and visualization of the relevant additional information in the user search interface. However this would open doors to a whole new (to this work) field of usability engineering and can be treated here only marginally.
42
43\subsection{Method}
44We start with examining the existing data and describing the evolving infrastructure in which the components are to be embedded. Then we formulate the function of \textbf{Semantic Search} distinguishing between the concept level -- using semantic relations between concepts or categories for better retrieval -- and the instances level -- allowing the user to explore the primary data collection via semantic resources (ontologies, vocabularies).
45
46Subsequently we introduce the underlying \textbf{Semantic Mapping Component} again distinguishing the two levels - concepts and instances. We describe the workflow and the central methods, building upon the existing pieces of the infrastructure (See \textit{Infrastructure Components} in \ref{SotA} ). A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
47
48In the practical part - processing the data - a necessary prerequisite is the dataset being expressed in RDF.
49Independently,  starting from a survey of existing semantic resources (ontologies, vocabularies), we identify an intial set of relevant
50ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}).
51
52Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
53in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
54
55\begin{itemize}
56\item a) define/use semantic relations between categories (RelationRegistry)
57\item b) employ ontological resources to enhance search in the dataset (SemanticSearch)
58\item c) specify a translation instructions for expressing dataset in rdf  (LinkedData)
59\end{itemize}
60
61\subsection{Expected Results}
62The primary concern of this work is the integrative effort, i.e. putting together existing pieces (resources, components and methods) especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT). Thus the main result of this work will be the \emph{specification} of the two components \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.
63This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.
64
65One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\footnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.
66
67
68\begin{description}
69\item [Specification] definition of the mapping mechanism
70\item [Prototype] proof of concept implementation
71\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
72\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
73\end{description}
74
75
76\subsection{Keywords}
77
78Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
79Fuzzy Search, Visual Search?
80
81Language Resources and Technology, LRT/NLP/HLT
82
83Ontology Visualization
Note: See TracBrowser for help on using the repository browser.