Changeset 2669


Ignore:
Timestamp:
03/09/13 21:34:43 (11 years ago)
Author:
vronk
Message:

sections in separate files

Location:
SMC4LRT
Files:
4 added
1 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Outline.tex

    r1205 r2669  
    7878\tableofcontents
    7979
    80 \section{Introduction}
    81 
    82 Title: Semantic Mapping (Component) for Language Resources
    83 
    84 \subsection{Main Goal}
    85 
    86 We propose a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through query expansion based on related categories/concepts and new means of exploring the dataset/knowledge-base via ontology-driven browsing.
    87 
    88 A trivial example for a concept-based query expansion:
    89 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like:
    90 \texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name =  Sue OR Person.FullName= is Sue}
    91 
    92 Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset.
    93 
    94 All these scenarios require a preprocessing step, that would produce the underlying linkage, both between categories/concepts and between instances (mapping literal values to entities). We refer to this task as semantic mapping, that shall be accomplished by coresponding "Semantic Mapping Component". In this work the focus lies on the process/method, i.e. on the specification and (prototypical) implementation of the component rather than trying to establish some final/accomplished mapping. Although a tentative/naive alignement on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aiming at creating the actual sensible mappings usable for real tasks.
    95 
    96 Actually due to the great diversity of resources and research tasks  such a "final" complete mapping/alignement does not seem achievable at all. Therefore also the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to adapt the mapping or apply different mapping with respect to their current task or research question,
    97 essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it.
    98 
    99 \subsection{Method}
    100 We start with examining the existing Data and describing the evolving Infrastructure in which the components are to be embedded.
    101 Then we formulate the task/function of Semantic Search on concept and on individuals level
    102 and the underlying Semantic Mapping and the requirements within the defined context,
    103 followed by a design proposal for an appropriate component fitting within the infrastructure.
    104 especially with focus on the feasibility of employing ontology mapping and alignement techniques and tools for the creation of mappings.
    105 
    106 In a prototype we want to deliver a proof of the concept,
    107 combined with an evaluation to verify the claims of fitness for the purpose.
    108 This evaluation is twofold. It shall verify the ability of the system to support dynamic mapping based on a set of test queries
    109 and secondly the usability of the ui-controls.
    110 
    111 
    112 +? Identify hooks into LOD?
    113 
    114 
    115 a) define/use semantic relations between categories (RelationRegistry)
    116 b) employ ontological resources to enhance search in the dataset (SemanticSearch)
    117 c) specify a translation instructions for expressing dataset in rdf  (LinkedData)
    118 
    119 
    120 \subsection{Expected Results}
    121 
    122 The main result of this work will be a specification of the pair of components the Semantic Search and the underlying Semantic Mapping. This propositions will be supported by a proof-of-concept implementation of these components and an evaluation of querying the dataset comparing traditional search and semantic search.
    123 
    124 One important by-product of the work will be the original dataset expressed as RDF with links into existing datasets/ontologies/knowledgebases, building a base for another nucleus of Linked Open Data.
    125 
    126 \begin{itemize}
    127 \item [Specification] definition of a mapping mechanism
    128 \item [Prototype] proof of concept implementation
    129 \item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search
    130 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases
    131 
    132 \end{itemize}
    133 
    134 
    135 \subsection{State of the Art}
    136 
    137 \begin{itemize}
    138 \item VLO - Virtual Language Observatory  \url{http://www.clarin.eu/vlo/}, \cite{VanUytvanck2010}
    139 \item LT-World ontology-based \url{http://www.lt-world.org/}, \cite{Joerg2010}
    140 \item VAS - Catch Plus
    141 \item OAEI
    142 \end{itemize}
    143 
    144 \subsection{Keywords}
    145 
    146 Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData
    147 Fuzzy Search, Visual Search?
    148 
    149 Language Resources and Technology, LRT/NLP/HLT
    150 
    151 Ontology Visualization
    152 
    153 Federated Search, Distributed Content Search
    154 (ILS - Integrated Library Systems)
    155 
    156 
    157 \section{Related Work}
    158 
    159 \subsection{Language Resources and Technology}
    160 
    161 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
    162 
    163 Need some number about the disparity in the field, number of institutes, resources, formats.
    164 
    165 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
    166 
    167 \subsubsection{CLARIN}
    168 
    169 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
    170 
    171     create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable
    172 
    173 This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas.
    174 
    175 The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN.
    176 CLARIN/NLP for SSH
    177 
    178 \subsubsection{Standards}
    179 
    180 \begin{description}
    181 \item[ISO12620] Data Category Registry
    182 \item[LAF] Linguistic Annotation Framework
    183 \item[CMDI] - (DC, OLAC, IMDI, TEI)
    184 \end{description}
    185 
    186 \subsubsection{NLP MD Catalogues}
    187 
    188 \begin{description}
    189 \item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology}
    190 \item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
    191 \item[OLAC]
    192 \item[ELRA]
    193 \item[LDC]
    194 \item[DFKI/LT-World]
    195 \end{description}
    196 
    197 \subsection{Ontologies}
    198 
    199 \subsubsection{Word, Sense, Concept}
    200 
    201 Lexicon vs. Ontology
    202 Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
    203 And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
    204 So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
    205 
    206 A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
    207 ontologies conceptualizing the linguistic domain
    208 
    209 They are special in that ("ontologized") Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
    210 Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
    211 
    212 a) as domain ontologies, describing aspects of the Resources\\
    213 b) as linguistic ontologies enriching the Lexicalization of Concepts
    214 
    215 Ontology and Lexicon \cite{Hirst2009}
    216 
    217 LingInfo/Lemon \cite{Buitelaar2009}
    218 
    219 We shouldn't need linguistic ontologies (LingInfo, LEmon), they are primarily relevant in the task of ontology population from texts, where the entities can be encountered in various word-forms in the context of the text.
    220 (Ontology Learning, Ontology-based Semantic Annotation of Text)
    221 And we are dealing with highly structured data with referenced in their nominal(?) form.
    222 
    223 Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
    224 So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
    225 
    226 controlled vocabularies?
    227 
    228 
    229 
    230 \subsubsection{Semantic Web - Linked Data}
    231 
    232 \begin{description}
    233 \item[RDF/OWL]
    234 \item[SKOS]
    235 \end{description}
    236 
    237 \subsubsection{OntologyMapping}
    238 
    239 
    240 \subsection{Visualization}
    241 
    242 
    243 \subsection{FederatedSearch}
    244 
    245 \subsubsection{Standards}
    246 
    247 \begin{description}
    248 \item[Z39.50/SRU/SRW/CQL] LoC
    249 \item[OAI-PMH]
    250 \end{description}
    251 
    252 
    253 \subsubsection{(Digital) Libraries}
    254 
    255 
    256 General (Libraries, Federations):
    257 
    258 \begin{description}
    259 \item[OCLC] \url{http://www.oclc.org}
    260     world's biggest Library Federation
    261 \item[LoC] Library of Congress \url{http://www.loc.gov}
    262 \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
    263 \item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
    264 \end{description}
    265 
    266 \subsubsection{Content Repositories}
    267 
    268 \begin{description}
    269 \item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/}
    270 \item[eSciDoc]  provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/}
    271 \item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/}
    272 \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/}
    273 \end{description}
    274 
    275 
    276 \subsubsection{(MD)search frameworks:}
    277 
    278 \begin{description}
    279 \item[Zebra/Z39.50] JZKit
    280 \item[Lucene/Solr]
    281 \item[eXist] - xml DB
    282 \end{description}
    283 
    284 \subsubsection{Content/Corpus Search}
    285 Corpus Search Systems
    286 \begin{description}
    287 \item[DDC]  - text-corpus
    288 \item[manatee] - text-corpus
    289 \item[CQP] - text-corps
    290 \item[TROVA] - MM annotated resources
    291 \item[ELAN] - MM annotated resources (editor + search)
    292 \end{description}
    293 
    294 \subsection{Summary}
     80\include{Introduction}
     81
     82
     83\include{Literature}
     84
    29585
    29686\section{Definitions}
    297 We want to clarify or lay dowhn a few terms and definition, ie explanation of our understanding
     87We want to clarify or lay down a few terms and definition, ie explanation of our understanding
    29888
    29989\begin{description}
     
    445235
    446236
    447 \section{System Design}
    448 SOA
    449 
    450 \subsection{Architecture}
    451 
    452 Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}:
    453 
    454 \begin{itemize}
    455 \item Data Category REgistry,
    456 \item Relation Registry
    457 \item Component Registry
    458 \item Vocabulary Alignement Service
    459 \end{itemize}
    460 merging the pieces of information provided by those,
    461 offering them semi-transaprently to the user (or application) on the consumption side.
    462 
    463 
    464 \subsection{CMDI}
    465 
    466 MDBrowser
    467 MDService
    468 
    469 \subsection{Query Language}
    470 CQL?
    471 
    472 \subsection{User Interface}
    473 
    474 \subsubsection{Query Input}
    475 
    476 \subsubsection{Columns}
    477 
    478 \subsubsection{Summaries}
    479 
    480 \subsubsection{Differential Views}
    481 Visualize impact of given mapping in terms of covered dataset (number of matched records).
    482 
    483 \section{Evaluation}
    484 
    485 \subsection{Research Questions }
    486 
    487 
    488 \subsection{Sample Queries}
    489 
    490 candidate Categories:
    491 ResourceType, Format
    492 Genre, Topic
    493 Project, Institution, Person, Publisher
    494 
    495 \subsection{Usability}
    496 
    497 \section{Conclusions and Futur Work}
     237
     238\include{System}
     239
     240\include{Evaluation}
     241
     242
     243\section{Conclusions and Future Work}
    498244
    499245\section{Questions, Remarks}
Note: See TracChangeset for help on using the changeset viewer.