source: SMC4LRT/Outline.tex @ 1196

Last change on this file since 1196 was 1188, checked in by vronk, 13 years ago

started seriously, but just chaotic intermediate version

File size: 18.7 KB
Line 
1% !TEX TS-program = pdflatex
2% !TEX encoding = UTF-8 Unicode
3
4% This is a simple template for a LaTeX document using the "article" class.
5% See "book", "report", "letter" for other types of document.
6
7\documentclass[11pt]{article} % use larger type; default would be 10pt
8
9\usepackage[utf8]{inputenc} % set input encoding (not needed with XeLaTeX)
10
11\usepackage{url}
12%\usepackage{svn-multi}
13
14% Subversion Information
15%\svnidlong
16%{$HeadURL: $}
17%{$LastChangedDate: $}
18%{$LastChangedRevision: $}
19%{$LastChangedBy: $}
20%\svnid{$Id$}
21
22
23%%% Examples of Article customizations
24% These packages are optional, depending whether you want the features they provide.
25% See the LaTeX Companion or other references for full information.
26
27%%% PAGE DIMENSIONS
28\usepackage{geometry} % to change the page dimensions
29\geometry{a4paper} % or letterpaper (US) or a5paper or....
30%\geometry{margin=1cm} % for example, change the margins to 2 inches all round
31\topmargin=-0.5in
32\textheight=700pt
33% \geometry{landscape} % set up the page for landscape
34%   read geometry.pdf for detailed page layout information
35
36\usepackage{graphicx} % support the \includegraphics command and options
37
38% \usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
39
40%%% PACKAGES
41\usepackage{booktabs} % for much better looking tables
42\usepackage{array} % for better arrays (eg matrices) in maths
43\usepackage{paralist} % very flexible & customisable lists (eg. enumerate/itemize, etc.)
44\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
45%\usepackage{subfig} % make it possible to include more than one captioned figure/table in a single float
46% These packages are all incorporated in the memoir class to one degree or another...
47
48%%% HEADERS & FOOTERS
49\usepackage{fancyhdr} % This should be set AFTER setting up the page geometry
50\pagestyle{plain} % options: empty , plain , fancy
51\renewcommand{\headrulewidth}{0pt} % customise the layout...
52\lhead{}\chead{}\rhead{}
53\lfoot{}\cfoot{\thepage}\rfoot{}
54
55%%% SECTION TITLE APPEARANCE
56\usepackage{sectsty}
57\allsectionsfont{\sffamily\mdseries\upshape} % (See the fntguide.pdf for font help)
58% (This matches ConTeXt defaults)
59
60%%% ToC (table of contents) APPEARANCE
61\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
62%\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
63%\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
64%\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
65
66%%% END Article customizations
67
68%%% The "real" document content comes below...
69
70\title{SMC4LRT - Master Outline}
71\author{Matej Durco}
72%\date{} % Activate to display a given date or no date (if empty),
73         % otherwise the current date is printed
74
75\begin{document}
76\maketitle
77
78\tableofcontents
79
80\section{Introduction}
81
82Title: Semantic Mapping (Component) for Language Resources
83
84\subsection{Main Goal}
85
86
87a) define/use semantic relations between categories (RelationRegistry)
88b) employ ontological resources to enhance search in the dataset (SemanticSearch)
89c) specify a translation instructions for expressing dataset in rdf  (LinkedData)
90
91Propose a semantic mapping component for Language Resources and Technology within the context of a federated infrastructure (being constructed in the project CLARIN).
92Due to the great diversity of resources and research tasks a full alignement is not achievable. Rather the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to control the mapping with respect to their current task,
93essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it.
94
95Example
96
97
98\subsection{Method}
99We start with examining the existing Data and describing the evolving Infrastructure.
100Then we formulate the task/function of Semantic Mapping and the requirements within the defined context,
101followed by a design proposal for an appropriate component fitting within the infrastructure.
102In a prototype we want to deliver a proof of the concept,
103combined with an evaluation to verify the claims of fitness for the purpose.
104This evaluation is twofold. It shall verify the ability of the system to support dynamic mapping based on a set of test queries
105and secondly the usability of the ui-controls.
106
107+? Identify hooks into LOD?
108
109\subsection{Expected Results}
110
111\begin{itemize}
112\item [Specification] definition of a mapping mechanism
113\item [Prototype] proof of concept implementation
114\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
115\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
116
117\end{itemize}
118
119
120\subsection{State of the Art}
121
122\begin{itemize}
123\item VLO - Virtual Language Observatory  \url{http://www.clarin.eu/vlo/}, \cite{VanUytvanck2010}
124\item Ontology and Lexicon \cite{Hirst2009}
125\item LingInfo/Lemon \cite{Buitelaar2009}
126\end{itemize}
127
128\subsection{Keywords}
129
130Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
131Fuzzy Search, Visual Search?
132
133Language Resources and Technology, LRT/NLP/HLT
134
135Ontology Visualization
136
137Federated Search, Distributed Content Search
138(ILS - Integrated Library Systems)
139
140
141\section{Related Work}
142
143\subsection{Language Resources and Technology}
144
145While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
146
147Need some number about the disparity in the field, number of institutes, resources, formats.
148
149This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
150
151\subsubsection{CLARIN}
152
153CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
154
155    create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable
156
157This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas.
158
159The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN.
160CLARIN/NLP for SSH
161
162\subsubsection{Standards}
163
164\begin{description}
165\item[ISO12620] Data Category Registry
166\item[LAF] Linguistic Annotation Framework
167\item[CMDI] - (DC, OLAC, IMDI, TEI)
168\end{description}
169
170\subsubsection{NLP MD Catalogues}
171
172\begin{description}
173\item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology}
174\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
175\item[OLAC]
176\item[ELRA]
177\item[LDC]
178\item[DFKI/LT-World]
179\end{description}
180
181\subsection{Ontologies}
182
183\subsubsection{Word, Sense, Concept}
184
185Lexicon vs. Ontology
186Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
187And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
188So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
189
190A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
191ontologies conceptualizing the linguistic domain
192
193They are special in that ("ontologized") Lexicons refere to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
194Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
195
196Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
197So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
198
199controlled vocabularies?
200
201
202\subsubsection{Semantic Web - Linked Data}
203
204\begin{description}
205\item[RDF/OWL]
206\item[SKOS]
207\end{description}
208
209\subsubsection{OntologyMapping}
210
211
212\subsection{Visualization}
213
214
215\subsection{FederatedSearch}
216
217\subsubsection{Standards}
218
219\begin{description}
220\item[Z39.50/SRU/SRW/CQL] LoC
221\item[OAI-PMH]
222\end{description}
223
224
225\subsubsection{(Digital) Libraries}
226
227
228General (Libraries, Federations):
229
230\begin{description}
231\item[OCLC] \url{http://www.oclc.org}
232    world's biggest Library Federation
233\item[LoC] Library of Congress \url{http://www.loc.gov}
234\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
235\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
236\end{description}
237
238\subsubsection{Content Repositories}
239
240\begin{description}
241\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/}
242\item[eSciDoc]  provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/}
243\item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/}
244\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/}
245\end{description}
246
247
248\subsubsection{(MD)search frameworks:}
249
250\begin{description}
251\item[Zebra/Z39.50] JZKit
252\item[Lucene/Solr]
253\item[eXist] - xml DB
254\end{description}
255
256\subsubsection{Content/Corpus Search}
257Corpus Search Systems
258\begin{description}
259\item[DDC]  - text-corpus
260\item[manatee] - text-corpus
261\item[CQP] - text-corps
262\item[TROVA] - MM annotated resources
263\item[ELAN] - MM annotated resources (editor + search)
264\end{description}
265
266\subsection{Summary}
267
268\section{Definitions}
269We want to clarify or lay dowhn a few terms and definition, ie explanation of our understanding
270
271\begin{description}
272\item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
273\item[Ontology]  "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
274\item[Word]  a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept)
275\item[Lexicon]  a collection of words, a (lexical) vocabulary
276\item[Vocabulary] an index providing mapping from Word (string) to Concept (uri)
277\item[(Data)Category] (almost) the same as Concept; Things like "Topic", "Genre", "Organization", "ResourceType" are instantiations of Category
278\item[ConceptualDomain] the Class of entities a Concept/Category denotes. For Organization it would be all (existing) organizations,  CD(ResourceType)={Corpus, Lexicon, Document, Image, Video, ...}. Entities of the domain can itself be Categories (ResourceType:Image), but it can be also individuals (Organization University of Vienna)
279\item[Entity] 
280\item[Resource] informational resource, in the context of CLARIN-Project  mainly Language Resources (Corpus, Lexicon, Multimedia)
281\item[Metadata Description] description of some properties of a resource.  MD-Record
282\item[Schema] - CMD-Profile
283\item[Annotation] 
284
285
286\end{description}
287
288\section{Analysis}
289
290\subsection{Data landscape}
291
292Describe situation regarding the datasets and formats
293
294collections, profiles/Terms, ResourceTypes!
295
296DC, OLAC,
297ISLE/IMDI, CHILDES, TEI, EAF!
298(CES/XCES)
299
300\subsection{Infrastructure}
301
302CMDI \cite{Broeder2010}
303
304
305\subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems}
306
307
308\subsubsection{Classification Schemes, Taxonomies }
309LCSH, DDC
310
311
312\subsubsection{Other controlled Vocabularies}
313Tagsets: STTS
314Language codes ISO-639-1
315
316\subsubsection{Domain Ontologies, Vocabularies}
317Organization-Lists
318LT-World !?
319
320
321\subsection{Use Cases}
322
323\begin{itemize}
324
325\item MD Search employing Semantic Mapping
326\item MD Search employing Fuzzy Search
327\item Content Search
328\item Combined MEtadata Content Search
329\item Visualization of the Results - charts on facets/dimensions
330
331\item  Create and publish Virtual Collection based on complex Search (intensional/extensional)
332\item  Let Create ad-hoc corpus
333\end{itemize}
334
335\section{Semantic Mapping}
336
337
338\subsection{Profiles to Data Categories}
339CMD:Profile.Comp.Elem -> DatCat
340
341
342\subsection{Semantic Relations between (Data)Categories}
343
344Relation Registry
345
346!check DCR-RR/Odijk2010 -follow up
347!Cf. Erhard Hinrichs 2009
348
349
350\subsection{Mapping from strings to Entities}
351
352Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
353
354Identify related ontologies:
355LT-World \cite{Joerg2010}
356
357task:
358\begin{enumerate}
359\item  express MDRecords in RDF
360\item  identify related ontologies/vocabularies (category -> vocabulary)
361\item  implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
362
363\fbox{  function lookup: Category x String -> ConceptualDomain}
364
365Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
366\end{enumerate} 
367
368
369
370\subsection{Semantic Search}
371
372Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
373Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
374with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
375
376In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
377Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
378
379Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
380
381?
382Facets
383Controlled Vocabularies
384Synonym Expansion (via TermExtraction(ContentSet))
385
386\subsection{Linked Data - Express dataset in RDF}
387
388Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
389So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
390
391Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
392But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
393
394Technical aspects (RDF-store?) / interface (ontology browser?)
395
396defining the Mapping:
397\begin{enumerate}
398\item convert to RDF
399translate: MDREcord -> [\#mdrecord \#property literal]
400\item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
401\end{enumerate}
402
403\subsection{Content/Annotation}
404AF + DCR + RR
405
406
407\subsection{Visualization}
408Landscape, Treemap, SOM
409
410Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
411
412
413\section{System Design}
414SOA
415
416\subsection{Architecture}
417
418Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
419
420\begin{itemize}
421\item Data Category REgistry,
422\item Relation Registry
423\item Component Registry
424\item Vocabulary Alignement Service
425\end{itemize}
426merging the pieces of information provided by those,
427offering them semi-transaprently to the user (or application) on the consumption side.
428
429
430\subsection{CMDI}
431
432MDBrowser
433MDService
434
435\subsection{Query Language}
436CQL?
437
438\subsection{User Interface}
439
440\subsubsection{Query Input}
441
442\subsubsection{Columns}
443
444\subsubsection{Summaries}
445
446\subsubsection{Differential Views}
447Visualize impact of given mapping in terms of covered dataset (number of matched records).
448
449\section{Evaluation}
450
451\subsection{Research Questions }
452
453
454\subsection{Sample Queries}
455
456candidate Categories:
457ResourceType, Format
458Genre, Topic
459Project, Institution, Person, Publisher
460
461\subsection{Usability}
462
463\section{Conclusions and Futur Work}
464
465\section{Questions, Remarks}
466
467\begin{itemized}
468\item How does this relate to federated search?
469\item ontologicky vs. semaziologicky (Semanticke priznaky: kategoriálne/archysémy, difernciacne, specifikacne)
470\end{itemized}
471
472
473\bibliographystyle{ieee}
474\bibliography{../../../2bib/lingua,../../../2bib/ontolingua}
475
476
477\end{document}
Note: See TracBrowser for help on using the repository browser.