source: SMC4LRT/Outline.tex @ 1201

Last change on this file since 1201 was 1200, checked in by vronk, 13 years ago
File size: 21.8 KB
Line 
1% !TEX TS-program = pdflatex
2% !TEX encoding = UTF-8 Unicode
3
4% This is a simple template for a LaTeX document using the "article" class.
5% See "book", "report", "letter" for other types of document.
6
7\documentclass[11pt]{article} % use larger type; default would be 10pt
8
9\usepackage[utf8]{inputenc} % set input encoding (not needed with XeLaTeX)
10
11\usepackage{url}
12%\usepackage{svn-multi}
13
14% Subversion Information
15%\svnidlong
16%{$HeadURL: $}
17%{$LastChangedDate: $}
18%{$LastChangedRevision: $}
19%{$LastChangedBy: $}
20%\svnid{$Id$}
21
22
23%%% Examples of Article customizations
24% These packages are optional, depending whether you want the features they provide.
25% See the LaTeX Companion or other references for full information.
26
27%%% PAGE DIMENSIONS
28\usepackage{geometry} % to change the page dimensions
29\geometry{a4paper} % or letterpaper (US) or a5paper or....
30%\geometry{margin=1cm} % for example, change the margins to 2 inches all round
31\topmargin=-0.5in
32\textheight=700pt
33% \geometry{landscape} % set up the page for landscape
34%   read geometry.pdf for detailed page layout information
35
36\usepackage{graphicx} % support the \includegraphics command and options
37
38% \usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
39
40%%% PACKAGES
41\usepackage{booktabs} % for much better looking tables
42\usepackage{array} % for better arrays (eg matrices) in maths
43\usepackage{paralist} % very flexible & customisable lists (eg. enumerate/itemize, etc.)
44\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
45%\usepackage{subfig} % make it possible to include more than one captioned figure/table in a single float
46% These packages are all incorporated in the memoir class to one degree or another...
47
48%%% HEADERS & FOOTERS
49\usepackage{fancyhdr} % This should be set AFTER setting up the page geometry
50\pagestyle{plain} % options: empty , plain , fancy
51\renewcommand{\headrulewidth}{0pt} % customise the layout...
52\lhead{}\chead{}\rhead{}
53\lfoot{}\cfoot{\thepage}\rfoot{}
54
55%%% SECTION TITLE APPEARANCE
56\usepackage{sectsty}
57\allsectionsfont{\sffamily\mdseries\upshape} % (See the fntguide.pdf for font help)
58% (This matches ConTeXt defaults)
59
60%%% ToC (table of contents) APPEARANCE
61\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
62%\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
63%\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
64%\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
65
66%%% END Article customizations
67
68%%% The "real" document content comes below...
69
70\title{SMC4LRT - Master Outline}
71\author{Matej Durco}
72%\date{} % Activate to display a given date or no date (if empty),
73         % otherwise the current date is printed
74
75\begin{document}
76\maketitle
77
78\tableofcontents
79
80\section{Introduction}
81
82Title: Semantic Mapping (Component) for Language Resources
83
84\subsection{Main Goal}
85
86We propose a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through query expansion based on related categories/concepts and new means of exploring the dataset/knowledge-base via ontology-driven browsing.
87
88A trivial example for a concept-based query expansion:
89Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
90\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
91
92Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
93
94All these scenarios require a preprocessing step, that would produce the underlying linkage, both between categories/concepts and between instances (mapping literal values to entities). We refer to this task as semantic mapping, that shall be accomplished by coresponding "Semantic Mapping Component". In this work the focus lies on the process/method, i.e. on the specification and (prototypical) implementation of the component rather than trying to establish some final/accomplished mapping. Although a tentative/naive alignement on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aiming at creating the actual sensible mappings usable for real tasks.
95
96Actually due to the great diversity of resources and research tasks  such a "final" complete mapping/alignement does not seem achievable at all. Therefore also the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to adapt the mapping or apply different mapping with respect to their current task or research question,
97essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it.
98
99\subsection{Method}
100We start with examining the existing Data and describing the evolving Infrastructure in which the components are to be embedded.
101Then we formulate the task/function of Semantic Search on concept and on individuals level
102and the underlying Semantic Mapping and the requirements within the defined context,
103followed by a design proposal for an appropriate component fitting within the infrastructure.
104especially with focus on the feasibility of employing ontology mapping and alignement techniques and tools for the creation of mappings.
105
106In a prototype we want to deliver a proof of the concept,
107combined with an evaluation to verify the claims of fitness for the purpose.
108This evaluation is twofold. It shall verify the ability of the system to support dynamic mapping based on a set of test queries
109and secondly the usability of the ui-controls.
110
111
112+? Identify hooks into LOD?
113
114
115a) define/use semantic relations between categories (RelationRegistry)
116b) employ ontological resources to enhance search in the dataset (SemanticSearch)
117c) specify a translation instructions for expressing dataset in rdf  (LinkedData)
118
119
120\subsection{Expected Results}
121
122The main result of this work will be a specification of the pair of components the Semantic Search and the underlying Semantic Mapping. This propositions will be supported by a proof-of-concept implementation of these components and an evaluation of querying the dataset comparing traditional search and semantic search.
123
124One important by-product of the work will be the original dataset expressed as RDF with links into existing datasets/ontologies/knowledgebases, building a base for another nucleus of Linked Open Data.
125
126\begin{itemize}
127\item [Specification] definition of a mapping mechanism
128\item [Prototype] proof of concept implementation
129\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
130\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
131
132\end{itemize}
133
134
135\subsection{State of the Art}
136
137\begin{itemize}
138\item VLO - Virtual Language Observatory  \url{http://www.clarin.eu/vlo/}, \cite{VanUytvanck2010}
139\item LT-World ontology-based \url{http://www.lt-world.org/}, \cite{Joerg2010}
140\item VAS - Catch Plus
141\item OAEI
142\end{itemize}
143
144\subsection{Keywords}
145
146Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
147Fuzzy Search, Visual Search?
148
149Language Resources and Technology, LRT/NLP/HLT
150
151Ontology Visualization
152
153Federated Search, Distributed Content Search
154(ILS - Integrated Library Systems)
155
156
157\section{Related Work}
158
159\subsection{Language Resources and Technology}
160
161While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
162
163Need some number about the disparity in the field, number of institutes, resources, formats.
164
165This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
166
167\subsubsection{CLARIN}
168
169CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
170
171    create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable
172
173This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas.
174
175The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN.
176CLARIN/NLP for SSH
177
178\subsubsection{Standards}
179
180\begin{description}
181\item[ISO12620] Data Category Registry
182\item[LAF] Linguistic Annotation Framework
183\item[CMDI] - (DC, OLAC, IMDI, TEI)
184\end{description}
185
186\subsubsection{NLP MD Catalogues}
187
188\begin{description}
189\item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology}
190\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
191\item[OLAC]
192\item[ELRA]
193\item[LDC]
194\item[DFKI/LT-World]
195\end{description}
196
197\subsection{Ontologies}
198
199\subsubsection{Word, Sense, Concept}
200
201Lexicon vs. Ontology
202Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
203And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
204So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
205
206A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
207ontologies conceptualizing the linguistic domain
208
209They are special in that ("ontologized") Lexicons refere to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
210Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
211
212Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
213So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
214
215controlled vocabularies?
216
217Ontology and Lexicon \cite{Hirst2009}
218
219LingInfo/Lemon \cite{Buitelaar2009}
220
221
222\subsubsection{Semantic Web - Linked Data}
223
224\begin{description}
225\item[RDF/OWL]
226\item[SKOS]
227\end{description}
228
229\subsubsection{OntologyMapping}
230
231
232\subsection{Visualization}
233
234
235\subsection{FederatedSearch}
236
237\subsubsection{Standards}
238
239\begin{description}
240\item[Z39.50/SRU/SRW/CQL] LoC
241\item[OAI-PMH]
242\end{description}
243
244
245\subsubsection{(Digital) Libraries}
246
247
248General (Libraries, Federations):
249
250\begin{description}
251\item[OCLC] \url{http://www.oclc.org}
252    world's biggest Library Federation
253\item[LoC] Library of Congress \url{http://www.loc.gov}
254\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
255\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
256\end{description}
257
258\subsubsection{Content Repositories}
259
260\begin{description}
261\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/}
262\item[eSciDoc]  provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/}
263\item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/}
264\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/}
265\end{description}
266
267
268\subsubsection{(MD)search frameworks:}
269
270\begin{description}
271\item[Zebra/Z39.50] JZKit
272\item[Lucene/Solr]
273\item[eXist] - xml DB
274\end{description}
275
276\subsubsection{Content/Corpus Search}
277Corpus Search Systems
278\begin{description}
279\item[DDC]  - text-corpus
280\item[manatee] - text-corpus
281\item[CQP] - text-corps
282\item[TROVA] - MM annotated resources
283\item[ELAN] - MM annotated resources (editor + search)
284\end{description}
285
286\subsection{Summary}
287
288\section{Definitions}
289We want to clarify or lay dowhn a few terms and definition, ie explanation of our understanding
290
291\begin{description}
292\item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
293\item[Ontology]  "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
294\item[Word]  a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept)
295\item[Lexicon]  a collection of words, a (lexical) vocabulary
296\item[Vocabulary] an index providing mapping from Word (string) to Concept (uri)
297\item[(Data)Category] (almost) the same as Concept; Things like "Topic", "Genre", "Organization", "ResourceType" are instantiations of Category
298\item[ConceptualDomain] the Class of entities a Concept/Category denotes. For Organization it would be all (existing) organizations,  CD(ResourceType)={Corpus, Lexicon, Document, Image, Video, ...}. Entities of the domain can itself be Categories (ResourceType:Image), but it can be also individuals (Organization University of Vienna)
299\item[Entity] 
300\item[Resource] informational resource, in the context of CLARIN-Project  mainly Language Resources (Corpus, Lexicon, Multimedia)
301\item[Metadata Description] description of some properties of a resource.  MD-Record
302\item[Schema] - CMD-Profile
303\item[Annotation] 
304
305
306\end{description}
307
308\section{Analysis}
309
310\subsection{Data landscape}
311
312Describe situation regarding the datasets and formats
313
314collections, profiles/Terms, ResourceTypes!
315
316DC, OLAC,
317ISLE/IMDI, CHILDES, TEI, EAF!
318(CES/XCES)
319
320\subsection{Infrastructure}
321
322CMDI \cite{Broeder2010}
323
324
325\subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems}
326
327
328\subsubsection{Classification Schemes, Taxonomies }
329LCSH, DDC
330
331
332\subsubsection{Other controlled Vocabularies}
333Tagsets: STTS
334Language codes ISO-639-1
335
336\subsubsection{Domain Ontologies, Vocabularies}
337Organization-Lists
338LT-World !?
339
340
341\subsection{Use Cases}
342
343\begin{itemize}
344
345\item MD Search employing Semantic Mapping
346\item MD Search employing Fuzzy Search
347\item Content Search
348\item Combined MEtadata Content Search
349\item Visualization of the Results - charts on facets/dimensions
350
351\item  Create and publish Virtual Collection based on complex Search (intensional/extensional)
352\item  Let Create ad-hoc corpus
353\end{itemize}
354
355\section{Semantic Mapping}
356
357
358\subsection{Profiles to Data Categories}
359CMD:Profile.Comp.Elem -> DatCat
360
361
362\subsection{Semantic Relations between (Data)Categories}
363
364Relation Registry
365
366!check DCR-RR/Odijk2010 -follow up
367!Cf. Erhard Hinrichs 2009
368
369
370\subsection{Mapping from strings to Entities}
371
372Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
373
374Identify related ontologies:
375LT-World \cite{Joerg2010}
376
377task:
378\begin{enumerate}
379\item  express MDRecords in RDF
380\item  identify related ontologies/vocabularies (category -> vocabulary)
381\item  implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
382
383\fbox{  function lookup: Category x String -> ConceptualDomain}
384
385Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
386\end{enumerate} 
387
388
389
390\subsection{Semantic Search}
391
392Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
393Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
394with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
395
396In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
397Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
398
399Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
400
401?
402Facets
403Controlled Vocabularies
404Synonym Expansion (via TermExtraction(ContentSet))
405
406\subsection{Linked Data - Express dataset in RDF}
407
408Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
409So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
410
411Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
412But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
413
414Technical aspects (RDF-store?) / interface (ontology browser?)
415
416defining the Mapping:
417\begin{enumerate}
418\item convert to RDF
419translate: MDREcord -> [\#mdrecord \#property literal]
420\item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
421\end{enumerate}
422
423\subsection{Content/Annotation}
424AF + DCR + RR
425
426
427\subsection{Visualization}
428Landscape, Treemap, SOM
429
430Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
431
432
433\section{System Design}
434SOA
435
436\subsection{Architecture}
437
438Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
439
440\begin{itemize}
441\item Data Category REgistry,
442\item Relation Registry
443\item Component Registry
444\item Vocabulary Alignement Service
445\end{itemize}
446merging the pieces of information provided by those,
447offering them semi-transaprently to the user (or application) on the consumption side.
448
449
450\subsection{CMDI}
451
452MDBrowser
453MDService
454
455\subsection{Query Language}
456CQL?
457
458\subsection{User Interface}
459
460\subsubsection{Query Input}
461
462\subsubsection{Columns}
463
464\subsubsection{Summaries}
465
466\subsubsection{Differential Views}
467Visualize impact of given mapping in terms of covered dataset (number of matched records).
468
469\section{Evaluation}
470
471\subsection{Research Questions }
472
473
474\subsection{Sample Queries}
475
476candidate Categories:
477ResourceType, Format
478Genre, Topic
479Project, Institution, Person, Publisher
480
481\subsection{Usability}
482
483\section{Conclusions and Futur Work}
484
485\section{Questions, Remarks}
486
487\begin{itemize}
488\item How does this relate to federated search?
489\item ontologicky vs. semaziologicky (Semanticke priznaky: kategoriálne/archysémy, difernciacne, specifikacne)
490\item "controlled vocabularies"
491\end{itemize}
492
493
494\bibliographystyle{ieee}
495\bibliography{../../../2bib/lingua,../../../2bib/ontolingua}
496
497
498\end{document}
Note: See TracBrowser for help on using the repository browser.