Changeset 1418


Ignore:
Timestamp:
06/12/11 16:44:58 (13 years ago)
Author:
vronk
Message:
 
Location:
SMC4LRT
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Expose.tex

    r1205 r1418  
    2828\usepackage{geometry} % to change the page dimensions
    2929\geometry{a4paper} % or letterpaper (US) or a5paper or....
    30 %\geometry{margin=1cm} % for example, change the margins to 2 inches all round
     30\geometry{margin=3.5cm} % for example, change the margins to 2 inches all round
    3131\topmargin=-0.5in
    32 \textheight=700pt
     32\textheight=680pt
    3333% \geometry{landscape} % set up the page for landscape
    3434%   read geometry.pdf for detailed page layout information
     
    7979\section{Main Goal}
    8080
    81 This work proposes a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
     81This work proposes a component that shall enhance search functionality over a \emph{large heterogeneous collection of metadata descriptions} of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through \emph{query expansion} based on related categories/concepts and new means of \emph{exploring the dataset} via ontology-driven browsing.
    8282
    83 Following two examples for better illustration: A simple example for a concept-based query expansion:
     83Following two examples for better illustration. First a concept-based query expansion:
    8484Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is synonym to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
    8585\begin{quote}
    8686\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR \\ Person.Name =  Sue OR Person.FullName = Sue}
    8787\end{quote}
    88 And an example for an ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
     88And second, an ontology-driven search: Starting from a list of topics the user can browse an ontology to find institutions concerned with those topics and retrieve a union of resources for the resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset, but rather in external linked-in semantic resources.
    8989
    90 Such \emph{semantic search} functionality requires a preprocessing step, that produces the underlying linkage both between categories/concepts and on the instance level. We refer to this task as \emph{semantic mapping}, that shall be realized by corresponding \texttt{Semantic Mapping Component}. In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the component -- rather than trying to establish actual final, accomplished alignment. Although a tentative, naive mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating the actual sensible mappings usable for real tasks.
     90Such \textbf{semantic search} functionality requires a preprocessing step, that produces the underlying linkage both between categories/concepts and on the instance level. We refer to this task as \textbf{semantic mapping}, that shall be realized by corresponding \texttt{Semantic Mapping Component}. In this work the focus lies on the method itself -- expressed in the specification and operationalized in the (prototypical) implementation of the component -- rather than trying to establish a final, accomplished alignment. Although a tentative, na\"ive mapping on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aimed at creating the actual sensible mappings usable for real tasks.
    9191
    9292In fact, due to the great diversity of resources and research tasks, a "final" complete alignment does not seem achievable at all. Therefore also the focus shall be on "soft" dynamic mapping, i.e. to enable the users to adapt the mapping or apply different mappings depending on their current task or research question essentially being able to actively manipulate the recall/precision ratio of the search results. This entails an examination of user interaction with and visualization of the relevant additional information in the user search interface. However this would open doors to a whole new (to this work) field of usability engineering and can be treated here only marginally.
    9393
    9494\section{Method}
    95 We start with examining the existing data and describing the evolving infrastructure in which the components are to be embedded. Then we formulate the function of \emph{Semantic Search} distinguishing between the concept level -- using semantic relations between concepts or categories for better retrieval -- and the instances level -- allowing the user to explore the primary data collection via semantic resources (ontologies, vocabularies).
     95We start with examining the existing data and describing the evolving infrastructure in which the components are to be embedded. Then we formulate the function of \textbf{Semantic Search} distinguishing between the concept level -- using semantic relations between concepts or categories for better retrieval -- and the instances level -- allowing the user to explore the primary data collection via semantic resources (ontologies, vocabularies).
    9696
    97 Subsequently we introduce the underlying \emph{Semantic Mapping Component} again distinguishing the two levels - concepts and instances. We describe the workflow and the central methods, building upon the existing pieces of the infrastructure (See Infrastructure Components in \ref{SotA} ). A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
     97Subsequently we introduce the underlying \textbf{Semantic Mapping Component} again distinguishing the two levels - concepts and instances. We describe the workflow and the central methods, building upon the existing pieces of the infrastructure (See \textit{Infrastructure Components} in \ref{SotA} ). A special focus will be put on the examination of the feasibility of employing ontology mapping and alignment techniques and tools for the creation of the mappings.
    9898
    9999In the practical part - processing the data - a necessary prerequisite is the dataset being expressed in RDF.
     
    108108This theoretical part will be accompanied by a proof-of-concept \emph{implementation} of the components and the results and findings of the \emph{evaluation}.
    109109
    110 One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  datasets/ontologies/knowledgebases, effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\footnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.
     110One promising by-product of the work will be the original dataset expressed as RDF with links into existing external  resources (ontologies, knowledgebases, vocabularies), effectively laying a foundation for providing this dataset as \emph{Linked Open Data}\footnote{\url{http://linkeddata.org/}} in the \emph{Web of Data}.
    111111
    112112\section{State of the Art}
    113113\label{SotA}
    114114
    115 \paragraph{Infrastructure Components}
     115\subsection*{Infrastructure Components}
    116116There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
    117117
    118118\texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}}
    119 are two integral components of the CLARIN Metadata Infrastructure, maintaining the normative information. Especially ISOcat as the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse, is the definitive primary reference vocabulary. \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase.
     119are two integral components of the \textit{CLARIN Metadata Infrastructure} maintaining the normative information. Especially \texttt{ISOcat} -- the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse -- is the definitive primary reference vocabulary \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase.
    120120
    121 And a last related intiative is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
     121And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
    122122
    123123\noindent
    124124All these components are running services, that this work shall directly build upon.
    125125
    126 \paragraph{LRT - Resources}
     126\subsection*{LRT Resources}
    127127The CLARIN project also delivers a valuable source of information on the normative resources in the domain in its current deliverable on \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3}. Next to covering ontologies as one type of resources this document offers an exhaustive collection of references to standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
    128128
    129 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}\cite{Joerg2010},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}} is a prominent resource providing the information about the entities (Institutions, Persons, Project, Tools, etc.) in this field of study.
     129Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
    130130
    131 \paragraph{Ontology Mapping}
     131\subsection*{Ontology Mapping}
    132132As the main contribution shall be the application of \emph{ontology mapping} techniques and technology, a comprehensive overview of this field and current developments is paramount. There seems to be a plethora of work on the topic and the difficult task will be to sort out the relevant contributions. The starting point for the investigation will be the overview of the field by Kalfoglou \cite{Kalfoglou2003} and a more recent summary of the key challenges by Shvaiko and Euzenat \cite{Shvaiko2008}.
    133133
    134 In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing different alignment methods applied on different domains.
     134In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
    135135
    136 One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
     136One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
    137137
    138 \paragraph{Linked Open Data}
     138\subsection*{Linked Open Data}
    139139As described previously one outcome of the work will be the dataset expressed in RDF interlinked with other semantic resources.
    140 This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
     140This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
    141141
    142142
    143 \bibliographystyle{acm}
     143%\bibliographystyle{acm}
     144\bibliographystyle{ieeetr}
    144145\bibliography{../../../2bib/lingua,../../../2bib/ontolingua,../../../2bib/semweb}
    145146
Note: See TracChangeset for help on using the changeset viewer.