Context Navigation

source: SMC4LRT/Outline.tex @ 1201

Last change on this file since 1201 was 1200, checked in by vronk, 13 years ago

File size: 21.8 KB

Line
1	% !TEX TS-program = pdflatex
2	% !TEX encoding = UTF-8 Unicode
3
4	% This is a simple template for a LaTeX document using the "article" class.
5	% See "book", "report", "letter" for other types of document.
6
7	\documentclass[11pt]{article} % use larger type; default would be 10pt
8
9	\usepackage[utf8]{inputenc} % set input encoding (not needed with XeLaTeX)
10
11	\usepackage{url}
12	%\usepackage{svn-multi}
13
14	% Subversion Information
15	%\svnidlong
16	%{$HeadURL: $}
17	%{$LastChangedDate: $}
18	%{$LastChangedRevision: $}
19	%{$LastChangedBy: $}
20	%\svnid{$Id$}
21
22
23	%%% Examples of Article customizations
24	% These packages are optional, depending whether you want the features they provide.
25	% See the LaTeX Companion or other references for full information.
26
27	%%% PAGE DIMENSIONS
28	\usepackage{geometry} % to change the page dimensions
29	\geometry{a4paper} % or letterpaper (US) or a5paper or....
30	%\geometry{margin=1cm} % for example, change the margins to 2 inches all round
31	\topmargin=-0.5in
32	\textheight=700pt
33	% \geometry{landscape} % set up the page for landscape
34	% read geometry.pdf for detailed page layout information
35
36	\usepackage{graphicx} % support the \includegraphics command and options
37
38	% \usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
39
40	%%% PACKAGES
41	\usepackage{booktabs} % for much better looking tables
42	\usepackage{array} % for better arrays (eg matrices) in maths
43	\usepackage{paralist} % very flexible & customisable lists (eg. enumerate/itemize, etc.)
44	\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
45	%\usepackage{subfig} % make it possible to include more than one captioned figure/table in a single float
46	% These packages are all incorporated in the memoir class to one degree or another...
47
48	%%% HEADERS & FOOTERS
49	\usepackage{fancyhdr} % This should be set AFTER setting up the page geometry
50	\pagestyle{plain} % options: empty , plain , fancy
51	\renewcommand{\headrulewidth}{0pt} % customise the layout...
52	\lhead{}\chead{}\rhead{}
53	\lfoot{}\cfoot{\thepage}\rfoot{}
54
55	%%% SECTION TITLE APPEARANCE
56	\usepackage{sectsty}
57	\allsectionsfont{\sffamily\mdseries\upshape} % (See the fntguide.pdf for font help)
58	% (This matches ConTeXt defaults)
59
60	%%% ToC (table of contents) APPEARANCE
61	\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
62	%\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
63	%\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
64	%\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
65
66	%%% END Article customizations
67
68	%%% The "real" document content comes below...
69
70	\title{SMC4LRT - Master Outline}
71	\author{Matej Durco}
72	%\date{} % Activate to display a given date or no date (if empty),
73	% otherwise the current date is printed
74
75	\begin{document}
76	\maketitle
77
78	\tableofcontents
79
80	\section{Introduction}
81
82	Title: Semantic Mapping (Component) for Language Resources
83
84	\subsection{Main Goal}
85
86	We propose a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through query expansion based on related categories/concepts and new means of exploring the dataset/knowledge-base via ontology-driven browsing.
87
88	A trivial example for a concept-based query expansion:
89	Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
90	\texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name = Sue OR Person.FullName= is Sue}
91
92	Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
93
94	All these scenarios require a preprocessing step, that would produce the underlying linkage, both between categories/concepts and between instances (mapping literal values to entities). We refer to this task as semantic mapping, that shall be accomplished by coresponding "Semantic Mapping Component". In this work the focus lies on the process/method, i.e. on the specification and (prototypical) implementation of the component rather than trying to establish some final/accomplished mapping. Although a tentative/naive alignement on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aiming at creating the actual sensible mappings usable for real tasks.
95
96	Actually due to the great diversity of resources and research tasks such a "final" complete mapping/alignement does not seem achievable at all. Therefore also the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to adapt the mapping or apply different mapping with respect to their current task or research question,
97	essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it.
98
99	\subsection{Method}
100	We start with examining the existing Data and describing the evolving Infrastructure in which the components are to be embedded.
101	Then we formulate the task/function of Semantic Search on concept and on individuals level
102	and the underlying Semantic Mapping and the requirements within the defined context,
103	followed by a design proposal for an appropriate component fitting within the infrastructure.
104	especially with focus on the feasibility of employing ontology mapping and alignement techniques and tools for the creation of mappings.
105
106	In a prototype we want to deliver a proof of the concept,
107	combined with an evaluation to verify the claims of fitness for the purpose.
108	This evaluation is twofold. It shall verify the ability of the system to support dynamic mapping based on a set of test queries
109	and secondly the usability of the ui-controls.
110
111
112	+? Identify hooks into LOD?
113
114
115	a) define/use semantic relations between categories (RelationRegistry)
116	b) employ ontological resources to enhance search in the dataset (SemanticSearch)
117	c) specify a translation instructions for expressing dataset in rdf (LinkedData)
118
119
120	\subsection{Expected Results}
121
122	The main result of this work will be a specification of the pair of components the Semantic Search and the underlying Semantic Mapping. This propositions will be supported by a proof-of-concept implementation of these components and an evaluation of querying the dataset comparing traditional search and semantic search.
123
124	One important by-product of the work will be the original dataset expressed as RDF with links into existing datasets/ontologies/knowledgebases, building a base for another nucleus of Linked Open Data.
125
126	\begin{itemize}
127	\item [Specification] definition of a mapping mechanism
128	\item [Prototype] proof of concept implementation
129	\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
130	\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
131
132	\end{itemize}
133
134
135	\subsection{State of the Art}
136
137	\begin{itemize}
138	\item VLO - Virtual Language Observatory \url{http://www.clarin.eu/vlo/}, \cite{VanUytvanck2010}
139	\item LT-World ontology-based \url{http://www.lt-world.org/}, \cite{Joerg2010}
140	\item VAS - Catch Plus
141	\item OAEI
142	\end{itemize}
143
144	\subsection{Keywords}
145
146	Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
147	Fuzzy Search, Visual Search?
148
149	Language Resources and Technology, LRT/NLP/HLT
150
151	Ontology Visualization
152
153	Federated Search, Distributed Content Search
154	(ILS - Integrated Library Systems)
155
156
157	\section{Related Work}
158
159	\subsection{Language Resources and Technology}
160
161	While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
162
163	Need some number about the disparity in the field, number of institutes, resources, formats.
164
165	This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
166
167	\subsubsection{CLARIN}
168
169	CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
170
171	create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable
172
173	This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas.
174
175	The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN.
176	CLARIN/NLP for SSH
177
178	\subsubsection{Standards}
179
180	\begin{description}
181	\item[ISO12620] Data Category Registry
182	\item[LAF] Linguistic Annotation Framework
183	\item[CMDI] - (DC, OLAC, IMDI, TEI)
184	\end{description}
185
186	\subsubsection{NLP MD Catalogues}
187
188	\begin{description}
189	\item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology}
190	\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
191	\item[OLAC]
192	\item[ELRA]
193	\item[LDC]
194	\item[DFKI/LT-World]
195	\end{description}
196
197	\subsection{Ontologies}
198
199	\subsubsection{Word, Sense, Concept}
200
201	Lexicon vs. Ontology
202	Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
203	And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
204	So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
205
206	A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
207	ontologies conceptualizing the linguistic domain
208
209	They are special in that ("ontologized") Lexicons refere to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
210	Lexicalized Ontologies: LingInfo, lemon: LMF + isocat/GOLD + Domain Ontology
211
212	Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
213	So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
214
215	controlled vocabularies?
216
217	Ontology and Lexicon \cite{Hirst2009}
218
219	LingInfo/Lemon \cite{Buitelaar2009}
220
221
222	\subsubsection{Semantic Web - Linked Data}
223
224	\begin{description}
225	\item[RDF/OWL]
226	\item[SKOS]
227	\end{description}
228
229	\subsubsection{OntologyMapping}
230
231
232	\subsection{Visualization}
233
234
235	\subsection{FederatedSearch}
236
237	\subsubsection{Standards}
238
239	\begin{description}
240	\item[Z39.50/SRU/SRW/CQL] LoC
241	\item[OAI-PMH]
242	\end{description}
243
244
245	\subsubsection{(Digital) Libraries}
246
247
248	General (Libraries, Federations):
249
250	\begin{description}
251	\item[OCLC] \url{http://www.oclc.org}
252	world's biggest Library Federation
253	\item[LoC] Library of Congress \url{http://www.loc.gov}
254	\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
255	\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
256	\end{description}
257
258	\subsubsection{Content Repositories}
259
260	\begin{description}
261	\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/}
262	\item[eSciDoc] provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/}
263	\item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/}
264	\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/}
265	\end{description}
266
267
268	\subsubsection{(MD)search frameworks:}
269
270	\begin{description}
271	\item[Zebra/Z39.50] JZKit
272	\item[Lucene/Solr]
273	\item[eXist] - xml DB
274	\end{description}
275
276	\subsubsection{Content/Corpus Search}
277	Corpus Search Systems
278	\begin{description}
279	\item[DDC] - text-corpus
280	\item[manatee] - text-corpus
281	\item[CQP] - text-corps
282	\item[TROVA] - MM annotated resources
283	\item[ELAN] - MM annotated resources (editor + search)
284	\end{description}
285
286	\subsection{Summary}
287
288	\section{Definitions}
289	We want to clarify or lay dowhn a few terms and definition, ie explanation of our understanding
290
291	\begin{description}
292	\item[Concept] sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
293	\item[Ontology] "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
294	\item[Word] a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept)
295	\item[Lexicon] a collection of words, a (lexical) vocabulary
296	\item[Vocabulary] an index providing mapping from Word (string) to Concept (uri)
297	\item[(Data)Category] (almost) the same as Concept; Things like "Topic", "Genre", "Organization", "ResourceType" are instantiations of Category
298	\item[ConceptualDomain] the Class of entities a Concept/Category denotes. For Organization it would be all (existing) organizations, CD(ResourceType)={Corpus, Lexicon, Document, Image, Video, ...}. Entities of the domain can itself be Categories (ResourceType:Image), but it can be also individuals (Organization University of Vienna)
299	\item[Entity]
300	\item[Resource] informational resource, in the context of CLARIN-Project mainly Language Resources (Corpus, Lexicon, Multimedia)
301	\item[Metadata Description] description of some properties of a resource. MD-Record
302	\item[Schema] - CMD-Profile
303	\item[Annotation]
304
305
306	\end{description}
307
308	\section{Analysis}
309
310	\subsection{Data landscape}
311
312	Describe situation regarding the datasets and formats
313
314	collections, profiles/Terms, ResourceTypes!
315
316	DC, OLAC,
317	ISLE/IMDI, CHILDES, TEI, EAF!
318	(CES/XCES)
319
320	\subsection{Infrastructure}
321
322	CMDI \cite{Broeder2010}
323
324
325	\subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems}
326
327
328	\subsubsection{Classification Schemes, Taxonomies }
329	LCSH, DDC
330
331
332	\subsubsection{Other controlled Vocabularies}
333	Tagsets: STTS
334	Language codes ISO-639-1
335
336	\subsubsection{Domain Ontologies, Vocabularies}
337	Organization-Lists
338	LT-World !?
339
340
341	\subsection{Use Cases}
342
343	\begin{itemize}
344
345	\item MD Search employing Semantic Mapping
346	\item MD Search employing Fuzzy Search
347	\item Content Search
348	\item Combined MEtadata Content Search
349	\item Visualization of the Results - charts on facets/dimensions
350
351	\item Create and publish Virtual Collection based on complex Search (intensional/extensional)
352	\item Let Create ad-hoc corpus
353	\end{itemize}
354
355	\section{Semantic Mapping}
356
357
358	\subsection{Profiles to Data Categories}
359	CMD:Profile.Comp.Elem -> DatCat
360
361
362	\subsection{Semantic Relations between (Data)Categories}
363
364	Relation Registry
365
366	!check DCR-RR/Odijk2010 -follow up
367	!Cf. Erhard Hinrichs 2009
368
369
370	\subsection{Mapping from strings to Entities}
371
372	Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
373
374	Identify related ontologies:
375	LT-World \cite{Joerg2010}
376
377	task:
378	\begin{enumerate}
379	\item express MDRecords in RDF
380	\item identify related ontologies/vocabularies (category -> vocabulary)
381	\item implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
382
383	\fbox{ function lookup: Category x String -> ConceptualDomain}
384
385	Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
386	\end{enumerate}
387
388
389
390	\subsection{Semantic Search}
391
392	Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
393	Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple ontologies,
394	with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
395
396	In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
397	Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
398
399	Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
400
401	?
402	Facets
403	Controlled Vocabularies
404	Synonym Expansion (via TermExtraction(ContentSet))
405
406	\subsection{Linked Data - Express dataset in RDF}
407
408	Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
409	So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
410
411	Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
412	But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
413
414	Technical aspects (RDF-store?) / interface (ontology browser?)
415
416	defining the Mapping:
417	\begin{enumerate}
418	\item convert to RDF
419	translate: MDREcord -> [\#mdrecord \#property literal]
420	\item map: \#mdrecord \#property literal -> [\#mdrecord \#property \#entity]
421	\end{enumerate}
422
423	\subsection{Content/Annotation}
424	AF + DCR + RR
425
426
427	\subsection{Visualization}
428	Landscape, Treemap, SOM
429
430	Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
431
432
433	\section{System Design}
434	SOA
435
436	\subsection{Architecture}
437
438	Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
439
440	\begin{itemize}
441	\item Data Category REgistry,
442	\item Relation Registry
443	\item Component Registry
444	\item Vocabulary Alignement Service
445	\end{itemize}
446	merging the pieces of information provided by those,
447	offering them semi-transaprently to the user (or application) on the consumption side.
448
449
450	\subsection{CMDI}
451
452	MDBrowser
453	MDService
454
455	\subsection{Query Language}
456	CQL?
457
458	\subsection{User Interface}
459
460	\subsubsection{Query Input}
461
462	\subsubsection{Columns}
463
464	\subsubsection{Summaries}
465
466	\subsubsection{Differential Views}
467	Visualize impact of given mapping in terms of covered dataset (number of matched records).
468
469	\section{Evaluation}
470
471	\subsection{Research Questions }
472
473
474	\subsection{Sample Queries}
475
476	candidate Categories:
477	ResourceType, Format
478	Genre, Topic
479	Project, Institution, Person, Publisher
480
481	\subsection{Usability}
482
483	\section{Conclusions and Futur Work}
484
485	\section{Questions, Remarks}
486
487	\begin{itemize}
488	\item How does this relate to federated search?
489	\item ontologicky vs. semaziologicky (Semanticke priznaky: kategoriÃ¡lne/archysÃ©my, difernciacne, specifikacne)
490	\item "controlled vocabularies"
491	\end{itemize}
492
493
494	\bibliographystyle{ieee}
495	\bibliography{../../../2bib/lingua,../../../2bib/ontolingua}
496
497
498	\end{document}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: