Changeset 1205 for SMC4LRT


Ignore:
Timestamp:
04/14/11 10:25:52 (13 years ago)
Author:
vronk
Message:

Expose intermediate (near finished) version

Location:
SMC4LRT
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Expose.tex

    r1200 r1205  
    3636\usepackage{graphicx} % support the \includegraphics command and options
    3737
    38 % \usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
     38%\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
     39\setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}
    3940
    4041%%% PACKAGES
     
    7879\section{Main Goal}
    7980
    80 This work proposes a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories, concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
     81This work proposes a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
    8182
    82 A trivial example for a concept-based query expansion:
    83 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
     83Following two examples for better illustration: A simple example for a concept-based query expansion:
     84Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is synonym to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
    8485\begin{quote}
    8586\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR \\ Person.Name =  Sue OR Person.FullName = Sue}
    8687\end{quote}
     88And an example for an ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
    8789
    88 An example concerning instance mapping: Starting from a list of topics the user can browse the ontology to find institutions concerned with those topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
     90Such \emph{semantic search} functionality requires a preprocessing step, that produces the underlying linkage both between categories/concepts and on the instance level. We refer to this task as \emph{semantic mapping}, that shall be realized by corresponding \texttt{Semantic Mapping Component}. In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the component -- rather than trying to establish actual final, accomplished alignment. Although a tentative, naive mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating the actual sensible mappings usable for real tasks.
    8991
    90 Such semantic search functionality requires a preprocessing step, that produces the underlying linkage, both between categories/concepts and between instances (mapping literal values to entities). We refer to this task as \emph{semantic mapping}, that shall be accomplished by coresponding \textsc{Semantic Mapping Component}. In this work the focus lies on the process/method operationalized in the specification and (prototypical) implementation of the component rather than trying to establish some final, accomplished mapping. Although a tentative, naive alignement on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aiming at creating the actual sensible mappings usable for real tasks.
    91 
    92 Actually due to the great diversity of resources and research tasks  such a "final" complete mapping/alignement does not seem achievable at all. Therefore also the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to adapt the mapping or apply different mappings with respect to their current task or research question,
    93 essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it.
     92In fact, due to the great diversity of resources and research tasks, a "final" complete alignment does not seem achievable at all. Therefore also the focus shall be on "soft" dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively manipulate the recall/precision ratio of the search results. This entails an examination of user interaction with and visualization of the relevant additional information in the user search interface. However this would open doors to a whole new (to this work) field of usability engineering and can be treated here only marginally.
    9493
    9594\section{Method}
    96 We start with examining the existing data and describing the evolving Infrastructure in which the components are to be embedded.
    97 Then we formulate the task/function of \emph{Semantic Search} distinguishing between the concept level - using semantic relations between concepts or categories for better retrieval and the individuals level - allowing the user to navigate to resources via semantic resources (ontologies, vocabularies).
    98 Subsequently we introduce the underlying \emph{Semantic Mapping Component} and the requirements within the defined context,
    99 followed by a design proposal for an appropriate component building upon the existing pieces of the infrastructure.
    100 A special focus will be put on the examination of the feasibility of employing ontology mapping and alignement techniques and tools for the creation of the mappings.
     95We start with examining the existing data and describing the evolving infrastructure in which the components are to be embedded. Then we formulate the function of \emph{Semantic Search} distinguishing between the concept level -- using semantic relations between concepts or categories for better retrieval -- and the instances level -- allowing the user to explore the primary data collection via semantic resources (ontologies, vocabularies).
    10196
    102 Based on a survey of existing semantic resources (ontologies, vocabularies), we identify an intial set of relevant
    103 ones, which will be used in the exercise of mapping the literal values in the metadata descriptions to the externally defined entities, essentially interrelating the dataset with external resources and entities. ("Linked Data"). A necessary preprocessing step is the task of expressing the dataset in RDF.
     97Subsequently we introduce the underlying \emph{Semantic Mapping Component} again distinguishing the two levels - concepts and instances. We describe the workflow and the central methods, building upon the existing pieces of the infrastructure (See Infrastructure Components in \ref{SotA} ). A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
    10498
    105 In a prototype we want to deliver a proof of the concept,
    106 combined with an evaluation to verify the claims of fitness for the purpose.
    107 This evaluation is twofold. It shall verify the ability of the system to support dynamic mapping based on a set of test queries
    108 and secondly the usability of the ui-controls.
     99In the practical part - processing the data - a necessary prerequisite is the dataset being expressed in RDF.
     100Independently,  starting from a survey of existing semantic resources (ontologies, vocabularies), we identify an intial set of relevant
     101ones. These will then be used in the exercise of mapping the literal values in the by then RDF-converted metadata descriptions onto externally defined entities, with the goal of interlinking the dataset with external resources (see \textit{Linked Data} in \ref{SotA}).
    109102
     103Finally in a prototypical implementation of the two components we want to deliver a proof of the concept, supported by an evaluation
     104in which we apply a set of  test queries and compare a traditional search with a semantically expanded query in terms of recall/precision indicators. A separate evaluation of the usability of the Semantic Search component  is indicated, however this issue can only be tackled marginally and will have to be outsourced into future work.
    110105
    111106\section{Expected Results}
    112 The main contribution of this work will be the putting together of existing pieces (resources and methods), primarily applying the ontology mapping methods to a specific domain-specific data collection.
     107The primary concern of this work is the integrative effort, i.e. putting together existing pieces (resources, components and methods) especially the application of techniques from ontology mapping to the domain-specific data collection (the domain of LRT). Thus the main result of this work will be the \emph{specification} of the two components \texttt{Semantic Search} and the underlying \texttt{Semantic Mapping}.
     108This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.
    113109
    114 Thus the main result of this work will be a specification of the pair of components the Semantic Search and the underlying Semantic Mapping. These propositions will be supported by a proof-of-concept implementation of these components and an evaluation of querying the dataset comparing traditional search and semantic search.
    115 
    116 
    117 One important by-product of the work will be the original dataset expressed as RDF with links into existing external  datasets/ontologies/knowledgebases, effectively setting up a base for veawing this dataset into the Linked Open Data \footnote{\url{http://linkeddata.org/}}. This issue is not only top-agenda pursued across many discplines but also recognized as key matter and accordingly supported also by European Commission within the current FP7 \footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}.
    118 
     110One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  datasets/ontologies/knowledgebases, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\footnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.
    119111
    120112\section{State of the Art}
     113\label{SotA}
    121114
    122 The most tightly related current work is the \texttt{VLO - Virtual Language Observatory} \footnote{\url{http://www.clarin.eu/vlo/}} \cite{VanUytvanck2010}, being developed within the CLARIN-project. This application operates on the same collection of data as is discussed in this work, however lead by the guiding principle to simplify the search for the user it employs a faceted search, mapping manually the appropriate metadata-fields from the different schemas/profiles to 10 fixed facets.
     115\paragraph{Infrastructure Components}
     116There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
    123117
    124 Also from the context of CLARIN the integral part of the infrastructure is the \texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}}, the ISO-standardized Data Category Registry \cite{Broeder2010,ISO12620:2009}. In the context of CLARIN there also has been work on and now a prototype of the so called \texttt{Relation Registry}, meant for defining relations between data categories. All these components are running services, that this work directly builds upon.
     118\texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}}
     119are two integral components of the CLARIN Metadata Infrastructure, maintaining the normative information. Especially ISOcat as the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse, is the definitive primary reference vocabulary. \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase.
    125120
    126 Another related intiative is that of a \texttt{Vocabulary Alignement Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}. There are plans to reuse or enhance the service for the needs of CLARIN project.
     121And a last related intiative is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
    127122
    128 The CLARIN project also delivers a valuable source of information on the normative resources in the domain in its current deliverable: \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3}. This document is relevant in two ways - first it covers ontologies as a type of Resources, but mainly it offers an exhaustive collection of references to standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
     123\noindent
     124All these components are running services, that this work shall directly build upon.
    129125
    130 From the point of view of existing semantic resource \texttt{LT-World}, \footnote{\url{http://www.lt-world.org/}} \cite{Joerg2010} the ontology-based portal covering primarily Language Technology being developed at DFKI \footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}} is a prominent resource providing the information about the entities (Institutions, Persons, Project, Tools, etc.) in this field of study. The underlying ontology is being restructured and cooperation was initialized.
     126\paragraph{LRT - Resources}
     127The CLARIN project also delivers a valuable source of information on the normative resources in the domain in its current deliverable on \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3}. Next to covering ontologies as one type of resources this document offers an exhaustive collection of references to standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
    131128
    132 As the main contribution shall be the applying of ontology mapping techniques and technology, an comprehensive overview of this field and current developments is indicated. There seems to be a plethora of work on the topic of ontology mapping and the task will be to sort out the relevant work. The starting point for the investigation of work regarding ontology mapping is the overview of the field by Kalfouglou \cite{Kalfoglou2003}, a more recent work by Shvaiko and Euzenat \cite{Shvaiko2008} and the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}}.
     129Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}\cite{Joerg2010},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}} is a prominent resource providing the information about the entities (Institutions, Persons, Project, Tools, etc.) in this field of study.
    133130
    134 One recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated quering and searching.
     131\paragraph{Ontology Mapping}
     132As the main contribution shall be the application of \emph{ontology mapping} techniques and technology, a comprehensive overview of this field and current developments is paramount. There seems to be a plethora of work on the topic and the difficult task will be to sort out the relevant contributions. The starting point for the investigation will be the overview of the field by Kalfoglou \cite{Kalfoglou2003} and a more recent summary of the key challenges by Shvaiko and Euzenat \cite{Shvaiko2008}.
    135133
    136 As described previously one outcome of the work will be the dataset expressed as RDF, interlinked with other semantic resources, which is follow the broad Linked Open Data effort as proposed by Berners-Lee \cite{TimBL2006}. A current comprehensive overview of the principles  of Linked Data and current applications is the work by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
     134In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing different alignment methods applied on different domains.
    137135
     136One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
    138137
    139 
    140 \section{Keywords}
    141 Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
    142 Fuzzy Search, Visual Search?
    143 schema-based ontology alignment
    144 
    145 Language Resources and Technology, LRT/NLP/HLT
    146 
    147 Ontology Visualization
    148 
    149 Federated Search, Distributed Content Search
    150 (ILS - Integrated Library Systems)
     138\paragraph{Linked Open Data}
     139As described previously one outcome of the work will be the dataset expressed in RDF interlinked with other semantic resources.
     140This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
    151141
    152142
  • SMC4LRT/Outline.tex

    r1200 r1205  
    207207ontologies conceptualizing the linguistic domain
    208208
    209 They are special in that ("ontologized") Lexicons refere to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
     209They are special in that ("ontologized") Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
    210210Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
     211
     212a) as domain ontologies, describing aspects of the Resources\\
     213b) as linguistic ontologies enriching the Lexicalization of Concepts
     214
     215Ontology and Lexicon \cite{Hirst2009}
     216
     217LingInfo/Lemon \cite{Buitelaar2009}
     218
     219We shouldn't need linguistic ontologies (LingInfo, LEmon), they are primarily relevant in the task of ontology population from texts, where the entities can be encountered in various word-forms in the context of the text.
     220(Ontology Learning, Ontology-based Semantic Annotation of Text)
     221And we are dealing with highly structured data with referenced in their nominal(?) form.
    211222
    212223Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
     
    215226controlled vocabularies?
    216227
    217 Ontology and Lexicon \cite{Hirst2009}
    218 
    219 LingInfo/Lemon \cite{Buitelaar2009}
    220228
    221229
     
    353361\end{itemize}
    354362
     363A trivial example for a concept-based query expansion:
     364Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
     365\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
     366
     367Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
     368
    355369\section{Semantic Mapping}
    356370
Note: See TracChangeset for help on using the changeset viewer.