Changeset 3234 for SMC4LRT


Ignore:
Timestamp:
08/05/13 13:25:21 (11 years ago)
Author:
vronk
Message:

still reformulating Introduction

Location:
SMC4LRT/chapters
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Infrastructure.tex

    r3204 r3234  
    11\chapter{Underlying infrastructure}
    2 \label{ch:components}
     2\label{ch:infra}
    33
    44
  • SMC4LRT/chapters/Introduction.tex

    r3204 r3234  
    44%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    55
    6 \todocode{install older python (2.5?) to be able to install dot2tex - transforming dot files to nicer pgf formatted graphs}\furl{http://dot2tex.googlecode.com/files/dot2tex-2.8.7.zip}\furl{file:/C:/Users/m/2kb/tex/dot2tex-2.8.7/}
    7 
    8 
    96\section{Motivation / problem statement}
    107
    11 While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types combined with project-specific needs. (See chapter \ref{ch:data} for analysis of the data disparity in the domain)
     8While in the Digital Libraries community a consolidation generally already happened and global federated networks of digital library repositories are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardization and integration efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming from the wide range of resource types combined with project-specific needs. (chapter \ref{ch:data} analyses the disparity in the data domain)
    129
    13 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at supporting research communities developing common large-scale pan-european infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common infrastructure. One core pillar of this infrastructure is a common framework for resource descriptions (metadata) the Component Metadata Framework (cf. \ref{def:cmdi}). This work discusses the component within the Component Metadata Infrastructure concerned with /dedicated to overcoming the heterogenity of the resource descriptions.
     10This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. The process has gained a new momentum thanks to large research infrastructure programmes introduced by the European Commission, aimed at fostering the development of common large-scale international infrastructures. One key player in this development is the project CLARIN (see section \ref{def:CLARIN}). The main objective of this initiative is to make language resources and technologies more easily available to scholars, by providing a common harmonized architecture for accessing and working with LRT. One core pillar of this architecture is the \emph{Component Metadata Infrastructure} (cf. \ref{def:cmdi})
     11-- a distributed system consisting of multiple interconnected applications aimed at creating and providing metadata for lLRT in a coherent harmonized way.
    1412
     13This work discusses a module within the Component Metadata Infrastructure -- the \emph{Semantic Mapping Component} -- dedicated to overcome or at least ease the semantic interoperability problem stemming from the heterogenity of the resource descriptions, without the reductionist approach of trying to impose one common description schema for all resources.
    1514
    1615\section{Main Goal}
    1716
    18 The primary goal of this work is to enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
     17The primary goal of this work is to \emph{\textbf{enhance search functionality} over a large heterogeneous collection of resource descriptions} in the field of Language Resources and Technology (LRT), henceforth referred to as \emph{semantic search} , distincting it from the necessary underlying processing, referred to as \emph{semantic mapping}.
    1918
    20 This main goal can be broken down along the three meanings of the term ``mapping'':
     19The -- notoriously polysemic -- term ``mapping'' can have three different meanings within this work,
     20that also translate into three corresponding subgoals:
    2121
    2222\begin{description}
    23 \item[crosswalk] establish links between crosswalks between metadata formats
     23\item[crosswalk] link related fields in different metadata formats
    2424\item[interpret] translate string labels in field values to semantic entities
    25 \item[visualize] provide a visualization of the data.
     25\item[visualize] provide appropriate means to explore the domain data.
    2626\end{description}
    2727
    28 Alternatively/  that allows query expansion by providing mappings between search indexes. This enables semantic search, ultimately increasing the recall when searching in metadata collections. The module builds on the Data Category Registry and Component Metadata Framework that are part of CMDI.
     28The work can further be divided along the schema / instance duality/dimension. Figure \ref{fig:master_outline} sketches the goals / conceptual space of this thesis.
    2929
     30%\includegraphics[width=\unitlength]{images/master_outline.eps}
     31\label{fig:master_outline}
     32\input{images/master_outline.eps_tex}
    3033
    31 Following two examples for better illustration. First a concept-based query expansion:
    32 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is synonym to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
     34\subsubsection*{Crosswalks}
     35Goal is not primarily to produce the crosswalks but rather to develop the service serving them.
     36
     37???
     38
     39While this may seem a rather trivial task, it is not if we consider the heterogeneity and complexity of the dataset,
     40further complicated by the fact, that this shall be community-driven process, without a central authority defining the relations
     41and that there may be even need for different relation sets for different tasks. In fact, a number of modules of the discussed infrastructure are dedicated to overcoming the semantic interoperability problem.
     42
     43\subsubsection*{Concept-based query expansion}
     44
     45Once the crosswalks are available, they can be used to expand/translate user queries, to match related fields across heterogeneous metadata formats, resulting in higher recall.
     46
     47\paragraph{Example}
     48Confronted with a user query searching in the notorious \concept{dublincore:title} the query has to be expanded to
     49all the semantically near fields (concept cluster), that are however labelled (or even structured) differently in other formats like
     50
     51\begin{quote}
     52\concept{resourceTitle, BookTitle, tei:titleStmt, Corpus/GeneralInfo/Name}
     53\end{quote}
     54
     55but probably not to other fields, using same (sub)strings for the field labels
     56but with different semantics, like:
     57
    3358\begin{quote}
    34 \texttt{Actor.Name = Sue OR Actor.FullName = Sue OR \\ Person.Name =  Sue OR Person.FullName = Sue}
     59\concept{Project/Title, Organisation/Name, Country/Name}
    3560\end{quote}
    36 And second, an ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
    3761
    38 Such \textbf{semantic search} functionality requires a preprocessing step, that produces the underlying linkage both between categories/concepts and on the instance level. We refer to this task as \textbf{semantic mapping}, that shall be realized by corresponding \texttt{Semantic Mapping Component}. In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the component -- rather than trying to establish a final, accomplished alignment. Although a tentative, na\"ive mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating the actual sensible mappings usable for real tasks.
     62\subsubsection*{Semantic interpretation}
    3963
    40 In fact, due to the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on ``soft'' dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively manipulate the recall/precision ratio of the search results. This entails an examination of user interaction with and visualization of the relevant additional information in the user search interface. However this would open doors to a whole new (to this work) field of usability engineering and can be treated here only marginally.
     64The problem of different labels for semantically similar or even identical things is even more so virulent on the level of individual values in the instance data. A number of metadata fields (like \concept{organization} or \concept{resource type})  have a constrained value domain that yet cannot be explicitly/exhaustively enumerated. This leads to inconsistent use of labels for referring to entities. (As the instance data shows, some organizations are referred to by more than 20 different labels.) Thus, one goal of this work is to map (string) values in selected fields to entities defined in corresponding vocabularies.
     65
     66\subsubsection*{Ontology-driven search / data exploration}
     67
     68By applying semantic web technologies, the user will be given new means of \emph{exploring the dataset} through semantic resources (ontology-driven search/browsing/exploration).
     69
     70\paragraph{Example}
     71Ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
     72
     73\subsubsection*{Visualization}
     74Given the large, heterogeneous and complex dataset, it seems indispensable to equip the user with advanced means for exploration of and interaction with it. Hence this subgoal aiming at exploring ways of visualizing the data at hand.
    4175
    4276\section{Method}
    43 We start with examining the existing data and describing the evolving infrastructure in which the components are to be embedded. Then we formulate the function of \textbf{Semantic Search} distinguishing between the concept level -- using semantic relations between concepts or categories for better retrieval -- and the instances level -- allowing the user to explore the primary data collection via semantic resources (ontologies, vocabularies).
     77The primary concern of this work is the integrative effort, i.e. bringing together existing pieces (resources, components and methods). We start with examining the existing data and the description of the evolving infrastructure in which this work is embedded.
    4478
    45 Subsequently we introduce the underlying \textbf{Semantic Mapping Component} again distinguishing the two levels - concepts and instances. We describe the workflow and the central methods, building upon the existing pieces of the infrastructure (See \textit{Infrastructure Components} in \ref{SotA} ). A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
     79Building on this groundwork, in accordance with the first subgoal, we lay out the design of the service for handling crosswalks and concept-based query expansion. We describe the workflow, the central methods and the role of the module relative to other parts of the infrastructure.
    4680
    47 In the practical part - processing the data - a necessary prerequisite is the dataset being expressed in RDF.
    48 Independently,  starting from a survey of existing semantic resources (ontologies, vocabularies), we identify an intial set of relevant
    49 ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}).
     81Subsequently, we explore the ways of integrating this service into exploitation tools (metadata search engines), to enhance search/retrieval through the use of semantic relations between concepts or categories.
    5082
    51 Finally, in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
    52 in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
     83This theoretical part will be accompanied by a prototypical implementation as proof of concept.
    5384
    54 \begin{itemize}
    55 \item a) define/use semantic relations between categories (RelationRegistry)
    56 \item b) employ ontological resources to enhance search in the dataset (SemanticSearch)
    57 \item c) specify a translation instructions for expressing dataset in rdf  (LinkedData)
    58 \end{itemize}
     85In an evaluation phase, we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision measures.
     86
     87In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the module -- rather than trying to establish final, accomplished alignment of the schemas. Although a tentative mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating further, more comprehensive mappings.
     88
     89In fact, due to the great diversity of resources and research tasks, a ``final'' complete alignment does not seem achievable at all. Therefore also the focus shall be on ``soft'' dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively influence the recall/precision ratio of the search results.
     90
     91\begin{note}
     92A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
     93
     94especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT).
     95\end{note}
     96
     97Serving the second subgoal, semantic interpretation on the instance level, we will propose the expression of all of the domain data (from meta-model specification to instances) in RDF, linking to corresponding entities in appropriate external
     98semantic resources (controlled vocabularies, ontologies).
     99Once the dataset is expressed in RDF, it can be exposed via a semantic web application and publicized as another nucleus of \emph{Linked Open Data} in the global \emph{Web Of Data}.
     100
     101A separate usability evaluation of the semantic search is indicated examining the user interaction with and display of the relevant additional information in the user search interface, however this issue can only be tackled marginally and will have to be outsourced into future work.
    59102
    60103\section{Expected Results}
    61 The primary concern of this work is the integrative effort, i.e. putting together existing pieces (resources, components and methods) especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT). Thus the main result of this work will be the \emph{specification} of the two components \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.
    62 This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.
    63104
    64 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}.
     105The main result of this work will be the \emph{specification} of the two modules \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.
     106This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components
     107and the results and findings of the \emph{evaluation}.
    65108
     109Another result of the work will be the original dataset expressed as RDF interlinked with existing external resources (ontologies, knowledge bases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\furl{http://linkeddata.org/} in the \emph{Web of Data}.
    66110
    67111\begin{description}
    68 \item [Specification] definition of the mapping mechanism
     112\item [Specification Semantic Mapping] design of the mapping mechanism
     113\item [Specification Semantic Search] design of the query expansion and integration with search engines
    69114\item [Prototype] proof of concept implementation
    70115\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
     
    73118
    74119\section{Structure of the work}
    75 The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}. In chapter \ref{cd:data} we analyze the situation in the data domain of LRT metadata  and
    76 in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work.
     120The work starts with examining the state of the art work in the two fields  language resources and technology and semantic web technologies in chapter \ref{ch:lit}, followed by administrative chapter \ref{ch:def} explaining the terms and abbreviations used in the work.
    77121
    78 The main part of the work is found in chapters \ref{ch:design}, \ref{ch:cmd2rdf} and \ref{ch:implementation}
     122In chapter \ref{ch:data} we analyze the situation in the data domain of LRT metadata and in chapter \ref{ch:infra} we discuss the individual software components /modules /services of the infrastructure underlying this work.
    79123
    80 The findings and results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
     124The main part of the work is found in chapters \ref{ch:design}, \ref{ch:implementation} and \ref{ch:cmd2rdf} laying out the design of the software module, the proposal  how to modell the data in RDF and the possibilities of visualization respectively.
    81125
     126The evaluation and the results are discussed in chapter \ref{ch:results}. Finally, in chapter \ref{ch:conclusions} we summarize the findings of the work and lay out where it could develop in the future.
    82127
    83128\section{Keywords}
Note: See TracChangeset for help on using the changeset viewer.