Changeset 2669
- Timestamp:
- 03/09/13 21:34:43 (11 years ago)
- Location:
- SMC4LRT
- Files:
-
- 4 added
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/Outline.tex
r1205 r2669 78 78 \tableofcontents 79 79 80 \section{Introduction} 81 82 Title: Semantic Mapping (Component) for Language Resources 83 84 \subsection{Main Goal} 85 86 We propose a component that shall enhance search functionality over a large heterogeneous collection of metadata descriptions of Language Resources and Technology (LRT). By applying semantic web technology the user shall be given both better recall through query expansion based on related categories/concepts and new means of exploring the dataset/knowledge-base via ontology-driven browsing. 87 88 A trivial example for a concept-based query expansion: 89 Confronted with a user query: \texttt{Actor.Name = Sue} and knowing that \texttt{Actor} is equivalent or similar to \texttt{Person} and \texttt{Name} is synonym to \texttt{FullName} the expanded query could look like: 90 \texttt{Actor.Name = Sue OR Actor.FullName = Sue OR Person.Name = Sue OR Person.FullName= is Sue} 91 92 Another example concerning instance mapping: the user looking for all resource produced by or linked to a given institution, does not have to guess or care for various spellings of the name of the institution used in the description of the resources, but rather can browse through a controlled vocabulary of institutions and see all the resources of given institution. While this could be achieved by simple normalizing of the literal-values (and indeed that definitely has to be one processing step), the linking to an ontology, enables to user to also continue browsing the ontology to find institutions that are related to the original institution by means of being concerned with similar topics and retrieve a union of resources for such resulting cluster. Thus in general the user is enabled to work with the data based on information that is not present in the original dataset. 93 94 All these scenarios require a preprocessing step, that would produce the underlying linkage, both between categories/concepts and between instances (mapping literal values to entities). We refer to this task as semantic mapping, that shall be accomplished by coresponding "Semantic Mapping Component". In this work the focus lies on the process/method, i.e. on the specification and (prototypical) implementation of the component rather than trying to establish some final/accomplished mapping. Although a tentative/naive alignement on a subset of the data will be proposed, this will be mainly used for evaluation and shall serve as basis for discussion with domain experts aiming at creating the actual sensible mappings usable for real tasks. 95 96 Actually due to the great diversity of resources and research tasks such a "final" complete mapping/alignement does not seem achievable at all. Therefore also the focus shall be on "soft", dynamic mapping, investigating the possibilities/methods to enable the users to adapt the mapping or apply different mapping with respect to their current task or research question, 97 essentially being able to actively manipulate the recall/precision ratio of their searches. This entails the examination of user interaction with and visualization of the relevant information in the user interface and enabling the user to act upon it. 98 99 \subsection{Method} 100 We start with examining the existing Data and describing the evolving Infrastructure in which the components are to be embedded. 101 Then we formulate the task/function of Semantic Search on concept and on individuals level 102 and the underlying Semantic Mapping and the requirements within the defined context, 103 followed by a design proposal for an appropriate component fitting within the infrastructure. 104 especially with focus on the feasibility of employing ontology mapping and alignement techniques and tools for the creation of mappings. 105 106 In a prototype we want to deliver a proof of the concept, 107 combined with an evaluation to verify the claims of fitness for the purpose. 108 This evaluation is twofold. It shall verify the ability of the system to support dynamic mapping based on a set of test queries 109 and secondly the usability of the ui-controls. 110 111 112 +? Identify hooks into LOD? 113 114 115 a) define/use semantic relations between categories (RelationRegistry) 116 b) employ ontological resources to enhance search in the dataset (SemanticSearch) 117 c) specify a translation instructions for expressing dataset in rdf (LinkedData) 118 119 120 \subsection{Expected Results} 121 122 The main result of this work will be a specification of the pair of components the Semantic Search and the underlying Semantic Mapping. This propositions will be supported by a proof-of-concept implementation of these components and an evaluation of querying the dataset comparing traditional search and semantic search. 123 124 One important by-product of the work will be the original dataset expressed as RDF with links into existing datasets/ontologies/knowledgebases, building a base for another nucleus of Linked Open Data. 125 126 \begin{itemize} 127 \item [Specification] definition of a mapping mechanism 128 \item [Prototype] proof of concept implementation 129 \item [Evaluation] evaluation results of querying the dataset comparing traditional search and semantic search 130 \item [LinkedData] translation of the source dataset to RDF-based format with links into existing datasets/ontologies/knowledgebases 131 132 \end{itemize} 133 134 135 \subsection{State of the Art} 136 137 \begin{itemize} 138 \item VLO - Virtual Language Observatory \url{http://www.clarin.eu/vlo/}, \cite{VanUytvanck2010} 139 \item LT-World ontology-based \url{http://www.lt-world.org/}, \cite{Joerg2010} 140 \item VAS - Catch Plus 141 \item OAEI 142 \end{itemize} 143 144 \subsection{Keywords} 145 146 Metadata interoperability, Ontology Mapping, Schema mapping, Crosswalk, Similarity measures, LinkedData 147 Fuzzy Search, Visual Search? 148 149 Language Resources and Technology, LRT/NLP/HLT 150 151 Ontology Visualization 152 153 Federated Search, Distributed Content Search 154 (ILS - Integrated Library Systems) 155 156 157 \section{Related Work} 158 159 \subsection{Language Resources and Technology} 160 161 While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought. 162 163 Need some number about the disparity in the field, number of institutes, resources, formats. 164 165 This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN. 166 167 \subsubsection{CLARIN} 168 169 CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is 170 171 create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable 172 173 This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas. 174 175 The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN. 176 CLARIN/NLP for SSH 177 178 \subsubsection{Standards} 179 180 \begin{description} 181 \item[ISO12620] Data Category Registry 182 \item[LAF] Linguistic Annotation Framework 183 \item[CMDI] - (DC, OLAC, IMDI, TEI) 184 \end{description} 185 186 \subsubsection{NLP MD Catalogues} 187 188 \begin{description} 189 \item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology} 190 \item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/} 191 \item[OLAC] 192 \item[ELRA] 193 \item[LDC] 194 \item[DFKI/LT-World] 195 \end{description} 196 197 \subsection{Ontologies} 198 199 \subsubsection{Word, Sense, Concept} 200 201 Lexicon vs. Ontology 202 Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical. 203 And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum. 204 So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent. 205 206 A special case are Linguistic Ontologies: isocat, GOLD, WALS.info 207 ontologies conceptualizing the linguistic domain 208 209 They are special in that ("ontologized") Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings. 210 Lexicalized Ontologies: LingInfo, lemon: LMF + isocat/GOLD + Domain Ontology 211 212 a) as domain ontologies, describing aspects of the Resources\\ 213 b) as linguistic ontologies enriching the Lexicalization of Concepts 214 215 Ontology and Lexicon \cite{Hirst2009} 216 217 LingInfo/Lemon \cite{Buitelaar2009} 218 219 We shouldn't need linguistic ontologies (LingInfo, LEmon), they are primarily relevant in the task of ontology population from texts, where the entities can be encountered in various word-forms in the context of the text. 220 (Ontology Learning, Ontology-based Semantic Annotation of Text) 221 And we are dealing with highly structured data with referenced in their nominal(?) form. 222 223 Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept. 224 So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~. 225 226 controlled vocabularies? 227 228 229 230 \subsubsection{Semantic Web - Linked Data} 231 232 \begin{description} 233 \item[RDF/OWL] 234 \item[SKOS] 235 \end{description} 236 237 \subsubsection{OntologyMapping} 238 239 240 \subsection{Visualization} 241 242 243 \subsection{FederatedSearch} 244 245 \subsubsection{Standards} 246 247 \begin{description} 248 \item[Z39.50/SRU/SRW/CQL] LoC 249 \item[OAI-PMH] 250 \end{description} 251 252 253 \subsubsection{(Digital) Libraries} 254 255 256 General (Libraries, Federations): 257 258 \begin{description} 259 \item[OCLC] \url{http://www.oclc.org} 260 world's biggest Library Federation 261 \item[LoC] Library of Congress \url{http://www.loc.gov} 262 \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm} 263 \item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/} 264 \end{description} 265 266 \subsubsection{Content Repositories} 267 268 \begin{description} 269 \item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/} 270 \item[eSciDoc] provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/} 271 \item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/} 272 \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/} 273 \end{description} 274 275 276 \subsubsection{(MD)search frameworks:} 277 278 \begin{description} 279 \item[Zebra/Z39.50] JZKit 280 \item[Lucene/Solr] 281 \item[eXist] - xml DB 282 \end{description} 283 284 \subsubsection{Content/Corpus Search} 285 Corpus Search Systems 286 \begin{description} 287 \item[DDC] - text-corpus 288 \item[manatee] - text-corpus 289 \item[CQP] - text-corps 290 \item[TROVA] - MM annotated resources 291 \item[ELAN] - MM annotated resources (editor + search) 292 \end{description} 293 294 \subsection{Summary} 80 \include{Introduction} 81 82 83 \include{Literature} 84 295 85 296 86 \section{Definitions} 297 We want to clarify or lay dow hn a few terms and definition, ie explanation of our understanding87 We want to clarify or lay down a few terms and definition, ie explanation of our understanding 298 88 299 89 \begin{description} … … 445 235 446 236 447 \section{System Design} 448 SOA 449 450 \subsection{Architecture} 451 452 Makes use of mulitple Components of the established infrastructure (CLARIN ) \cite{Varadi2008}, \cite{Broeder2010}: 453 454 \begin{itemize} 455 \item Data Category REgistry, 456 \item Relation Registry 457 \item Component Registry 458 \item Vocabulary Alignement Service 459 \end{itemize} 460 merging the pieces of information provided by those, 461 offering them semi-transaprently to the user (or application) on the consumption side. 462 463 464 \subsection{CMDI} 465 466 MDBrowser 467 MDService 468 469 \subsection{Query Language} 470 CQL? 471 472 \subsection{User Interface} 473 474 \subsubsection{Query Input} 475 476 \subsubsection{Columns} 477 478 \subsubsection{Summaries} 479 480 \subsubsection{Differential Views} 481 Visualize impact of given mapping in terms of covered dataset (number of matched records). 482 483 \section{Evaluation} 484 485 \subsection{Research Questions } 486 487 488 \subsection{Sample Queries} 489 490 candidate Categories: 491 ResourceType, Format 492 Genre, Topic 493 Project, Institution, Person, Publisher 494 495 \subsection{Usability} 496 497 \section{Conclusions and Futur Work} 237 238 \include{System} 239 240 \include{Evaluation} 241 242 243 \section{Conclusions and Future Work} 498 244 499 245 \section{Questions, Remarks}
Note: See TracChangeset
for help on using the changeset viewer.