Changeset 1188 for SMC4LRT


Ignore:
Timestamp:
04/01/11 21:55:53 (13 years ago)
Author:
vronk
Message:

started seriously, but just chaotic intermediate version

File:
1 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Outline.tex

    r1186 r1188  
    1010
    1111\usepackage{url}
     12%\usepackage{svn-multi}
     13
     14% Subversion Information
     15%\svnidlong
     16%{$HeadURL: $}
     17%{$LastChangedDate: $}
     18%{$LastChangedRevision: $}
     19%{$LastChangedBy: $}
     20%\svnid{$Id$}
     21
    1222
    1323%%% Examples of Article customizations
     
    7383
    7484\subsection{Main Goal}
     85
     86
     87a) define/use semantic relations between categories (RelationRegistry)
     88b) employ ontological resources to enhance search in the dataset (SemanticSearch)
     89c) specify a translation instructions for expressing dataset in rdf  (LinkedData)
     90
    7591Propose a semantic mapping component for Language Resources and Technology within the context of a federated infrastructure (being constructed in the project CLARIN).
    7692Due to the great diversity of resources and research tasks a full alignement is not achievable. Rather the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to control the mapping with respect to their current task,
    7793essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it.
     94
     95Example
    7896
    7997
     
    87105and secondly the usability of the ui-controls.
    88106
     107+? Identify hooks into LOD?
     108
    89109\subsection{Expected Results}
    90110
    91111\begin{itemize}
    92 
    93 \item Specification
    94 \item Prototype
    95 \item Evaluation
     112\item [Specification] definition of a mapping mechanism
     113\item [Prototype] proof of concept implementation
     114\item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
     115\item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
    96116
    97117\end{itemize}
     
    101121
    102122\begin{itemize}
    103 \item OLAC - Open Language Archives Community \ url{http://www.language-archives.org/}
    104 \item VLO - Virtual Language Observatory  \url{http://www.clarin.eu/vlo/}
    105 \item nature.com OpenSearch: A Case Study in OpenSearch and SRU Integration \cite{hammond2010}
    106 \item Kowalski (2011): Information Retrieval  \cite{kowalski2011}
     123\item VLO - Virtual Language Observatory  \url{http://www.clarin.eu/vlo/}, \cite{VanUytvanck2010}
     124\item Ontology and Lexicon \cite{Hirst2009}
     125\item LingInfo/Lemon \cite{Buitelaar2009}
    107126\end{itemize}
    108127
    109128\subsection{Keywords}
    110129
    111 Information retrieval, Information Discovery, IR-Systems
    112 ILS - Integrated Library Systems
    113 
    114 Metadata interoperability - schema mapping /semantic mapping repository, crosswalk,
    115 
    116 Distributed content search, federated search
    117 
    118 Fuzzy Search / Similarity measures
    119 schema mapping / semantic mapping repository profiling?
    120 
    121 Visualization (Treemap, SOM)
    122 
    123 \section{Context}
     130Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
     131Fuzzy Search, Visual Search?
     132
     133Language Resources and Technology, LRT/NLP/HLT
     134
     135Ontology Visualization
     136
     137Federated Search, Distributed Content Search
     138(ILS - Integrated Library Systems)
     139
     140
     141\section{Related Work}
    124142
    125143\subsection{Language Resources and Technology}
     
    129147Need some number about the disparity in the field, number of institutes, resources, formats.
    130148
    131 This situation has been identified by the community and multiple standardization initiatives had been conducted. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN
    132 
    133 \subsection{CLARIN}
     149This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
     150
     151\subsubsection{CLARIN}
    134152
    135153CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
     
    142160CLARIN/NLP for SSH
    143161
    144 \subsection{Standards}
    145 
    146 \begin{description}
    147 \item[ISO12620]
    148 \item[Z39.50/SRU/SRW/CQL] LoC
     162\subsubsection{Standards}
     163
     164\begin{description}
     165\item[ISO12620] Data Category Registry
     166\item[LAF] Linguistic Annotation Framework
    149167\item[CMDI] - (DC, OLAC, IMDI, TEI)
    150 \item[RDF/OWL]
    151 \end{description}
    152 
    153 \subsection{MD Catalogues}
    154 
    155 \subsubsection{NLP}
     168\end{description}
     169
     170\subsubsection{NLP MD Catalogues}
    156171
    157172\begin{description}
     
    164179\end{description}
    165180
     181\subsection{Ontologies}
     182
     183\subsubsection{Word, Sense, Concept}
     184
     185Lexicon vs. Ontology
     186Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
     187And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
     188So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
     189
     190A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
     191ontologies conceptualizing the linguistic domain
     192
     193They are special in that ("ontologized") Lexicons refere to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
     194Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
     195
     196Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
     197So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
     198
     199controlled vocabularies?
     200
     201
     202\subsubsection{Semantic Web - Linked Data}
     203
     204\begin{description}
     205\item[RDF/OWL]
     206\item[SKOS]
     207\end{description}
     208
     209\subsubsection{OntologyMapping}
     210
     211
     212\subsection{Visualization}
     213
     214
     215\subsection{FederatedSearch}
     216
     217\subsubsection{Standards}
     218
     219\begin{description}
     220\item[Z39.50/SRU/SRW/CQL] LoC
     221\item[OAI-PMH]
     222\end{description}
     223
     224
    166225\subsubsection{(Digital) Libraries}
    167226
    168 Digital Libraries
    169227
    170228General (Libraries, Federations):
     
    174232    world's biggest Library Federation
    175233\item[LoC] Library of Congress \url{http://www.loc.gov}
    176 \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections_ en.htm}
     234\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
    177235\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
    178236\end{description}
     
    187245\end{description}
    188246
    189 
    190 \subsection{Technologies}
    191247
    192248\subsubsection{(MD)search frameworks:}
     
    208264\end{description}
    209265
     266\subsection{Summary}
     267
     268\section{Definitions}
     269We want to clarify or lay dowhn a few terms and definition, ie explanation of our understanding
     270
     271\begin{description}
     272\item[Concept]  sense, idea, philosophical problem, which we don't need to discuss here. For our purposes we say: Basic "entity" in an ontology? that of what an ontology is build
     273\item[Ontology]  "a explicit specification of a conceptualization" [cite!], but for us mainly a collection of concepts as opposed to lexicon, which is a collection of words.
     274\item[Word]  a lexical unit, a word in a language, something that has a surface Realization (writtenForm) and is a carrier of sense. so a Relation holds: hasSense(Word, Concept)
     275\item[Lexicon]  a collection of words, a (lexical) vocabulary
     276\item[Vocabulary] an index providing mapping from Word (string) to Concept (uri)
     277\item[(Data)Category] (almost) the same as Concept; Things like "Topic", "Genre", "Organization", "ResourceType" are instantiations of Category
     278\item[ConceptualDomain] the Class of entities a Concept/Category denotes. For Organization it would be all (existing) organizations,  CD(ResourceType)={Corpus, Lexicon, Document, Image, Video, ...}. Entities of the domain can itself be Categories (ResourceType:Image), but it can be also individuals (Organization University of Vienna)
     279\item[Entity]
     280\item[Resource] informational resource, in the context of CLARIN-Project  mainly Language Resources (Corpus, Lexicon, Multimedia)
     281\item[Metadata Description] description of some properties of a resource.  MD-Record
     282\item[Schema] - CMD-Profile
     283\item[Annotation]
     284
     285
     286\end{description}
    210287
    211288\section{Analysis}
     
    215292Describe situation regarding the datasets and formats
    216293
    217 collections, profiles/Terms
     294collections, profiles/Terms, ResourceTypes!
    218295
    219296DC, OLAC,
     
    223300\subsection{Infrastructure}
    224301
    225 CMDI
     302CMDI \cite{Broeder2010}
     303
     304
     305\subsection{Ontologies, Controlled Vocabularies, Knowledge Organizing Systems}
     306
     307
     308\subsubsection{Classification Schemes, Taxonomies }
     309LCSH, DDC
     310
     311
     312\subsubsection{Other controlled Vocabularies}
     313Tagsets: STTS
     314Language codes ISO-639-1
     315
     316\subsubsection{Domain Ontologies, Vocabularies}
     317Organization-Lists
     318LT-World !?
     319
    226320
    227321\subsection{Use Cases}
     
    242336
    243337
     338\subsection{Profiles to Data Categories}
     339CMD:Profile.Comp.Elem -> DatCat
     340
     341
     342\subsection{Semantic Relations between (Data)Categories}
     343
     344Relation Registry
     345
     346!check DCR-RR/Odijk2010 -follow up
     347!Cf. Erhard Hinrichs 2009
     348
     349
     350\subsection{Mapping from strings to Entities}
     351
     352Based on the textual values in the Metadata-descriptions find matching entities in selected Ontologies.
     353
     354Identify related ontologies:
     355LT-World \cite{Joerg2010}
     356
     357task:
     358\begin{enumerate}
     359\item  express MDRecords in RDF
     360\item  identify related ontologies/vocabularies (category -> vocabulary)
     361\item  implement (reuse) a lookup/mapping function (Vocabulary Alignement Service? CATCH-PLUS?)
     362
     363\fbox{  function lookup: Category x String -> ConceptualDomain}
     364
     365Normally this would be served by dedicated controlled vocabularies, but expect also some string-normalizing preprocessing etc.
     366\end{enumerate}
     367
     368
     369
     370\subsection{Semantic Search}
     371
     372Main purpose for the undertaking described in previous two chapters (mapping of concepts and entities) is to enhance the search capabilities of the MDService serving the Metadata/Resources-data. Namely to enhance it by employing ontological resources.
     373Mainly this enhancement shall mean, that the user can access the data indirectly by browsing one or multiple  ontologies,
     374with which the data will then be linked. These could be for example ontologies of Organizations and Projects.
     375
     376In this section we want to explore, how this shall be accomplished, ie how to bring the enhanced capabilities to the user.
     377Crucial aspect is the question how to deal with the even greater amount of information in a user-friendly way, ie how to prevent overwhelming, intimidating or frustrating the user.
    244378
    245379Semi-transparently means, that primarily the semantic mapping shall integrate seamlessly in the interaction with the service, but it shall "explain" - offer enough information - on demand, for the user to understand its role and also being able manipulate easily.
    246380
    247 
     381?
    248382Facets
    249383Controlled Vocabularies
    250384Synonym Expansion (via TermExtraction(ContentSet))
    251385
    252 Defining the Mapping - SKOS, Owl?
    253 
    254 Distinction between Metadata and content blurry
    255 
    256 \subsection{Metadata}
    257 CMD:Profile.Comp.Elem -> DatCat
    258 
     386\subsection{Linked Data - Express dataset in RDF}
     387
     388Partly as by-product of the entities-mapping effort we will get the metadata-description rendered in RDF, linked with
     389So theoretically we then only need to provide them "on the web", to make them a nucleus of the LinkedData-Cloud.
     390
     391Practically this won't be that straight-forward as the mapping to entities will be a hell of a work.
     392But once that is solved, or for the subsets that it is solved, the publication of that data on the "SemanticWeb" should be easy.
     393
     394Technical aspects (RDF-store?) / interface (ontology browser?)
     395
     396defining the Mapping:
     397\begin{enumerate}
     398\item convert to RDF
     399translate: MDREcord -> [\#mdrecord \#property literal]
     400\item map: \#mdrecord \#property literal  -> [\#mdrecord \#property \#entity]
     401\end{enumerate}
    259402
    260403\subsection{Content/Annotation}
    261404AF + DCR + RR
    262405
     406
    263407\subsection{Visualization}
    264408Landscape, Treemap, SOM
     
    266410Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf
    267411
     412
    268413\section{System Design}
    269414SOA
     
    271416\subsection{Architecture}
    272417
    273 Makes use of mulitple Components of the established infrastructure (CLARIN ):
     418Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
    274419
    275420\begin{itemize}
     
    309454\subsection{Sample Queries}
    310455
     456candidate Categories:
     457ResourceType, Format
     458Genre, Topic
     459Project, Institution, Person, Publisher
     460
    311461\subsection{Usability}
    312462
    313 \section{Literature}
    314 
    315 \subsection{Standards}
    316 
    317 ISO 12620 - Data Category Registry
    318 
    319 LoC - SRU / CQL
    320 
    321 OAI-PMH
    322 
    323 LAF - Linguistic Annotation Framework
    324 
    325 \subsection{Books, Papers}
    326 A Formal Framework for Linguistic Annotation Steven Bird and Mark Liberman 2000
    327 
    328 Gerald Kowalski (2011): Information Retrieval Architecture and Algorithms
    329 
    330 Chowdhury, Gobinda G. : Introduction to modern information retrieval. - London : Facet Publ., 2010 (trocken)
    331 
    332 \subsection{DigLibs}
    333 
    334 http://publik.tuwien.ac.at/searchdb.php
    335 
    336 ISIWebOfKnowledge (/ + http://scientific.thomsonwebplus.com)
    337 
    338 \subsection{Journals, Conferences}
    339 
    340 ACM
    341 http://www.sigir.org/
    342 
    343 ECIR
    344 
    345 Cambridge Journals
    346 
    347 
    348 
    349 
    350 \bibliographystyle{plain}
    351 \bibliography{../lit/ir}
     463\section{Conclusions and Futur Work}
     464
     465\section{Questions, Remarks}
     466
     467\begin{itemized}
     468\item How does this relate to federated search?
     469\item ontologicky vs. semaziologicky (Semanticke priznaky: kategoriálne/archysémy, difernciacne, specifikacne)
     470\end{itemized}
     471
     472
     473\bibliographystyle{ieee}
     474\bibliography{../../../2bib/lingua,../../../2bib/ontolingua}
     475
    352476
    353477\end{document}
Note: See TracChangeset for help on using the changeset viewer.