source: SMC4LRT/Literature.tex @ 2672

Last change on this file since 2672 was 2671, checked in by vronk, 11 years ago

mostly outsourcing individual chapters to separate tex-files

File size: 10.9 KB
Line 
1
2\section{Previous Work}
3\label{SotA}
4
5\subsection*{Infrastructure Components}
6In recent years, multiple large-scale initiatives have been set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular. A comprehensive architecture for harmonized handling of metadata -- the Component Metadata Infrastructure (CMDI)\footnote{\url{http://www.clarin.eu/cmdi}} \cite{Broeder+2011} -- is being implemented within the CLARIN project\footnote{\url{http://clarin.eu}}. This service-oriented architecture consisting of a number of interacting software modules allows metadata creation and provision based on a flexible meta model, the \emph{Component Metadata Framework}, that facilitates creation of customized metadata schemas -- acknowledging that no one metadata schema can cover the large variety of language resources and usage scenarios -- however at the same time equipped with well-defined methods to ground their semantic interpretation in a community-wide controlled vocabulary -- the data category registry \cite{Kemps-Snijders+2009,Broeder+2010}.
7
8Individual components of this infrastructure will be described in more detail in the section \ref{components}.
9
10
11\subsection*{LRT  Resources}
12The CLARIN project also delivers a valuable source of information on the normative resources in the domain in its current deliverable on \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3}. Next to covering ontologies as one type of resources this document offers an exhaustive collection of references to standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
13
14Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
15
16\subsection*{Ontology Mapping} 
17As the main contribution shall be the application of \emph{ontology mapping} techniques and technology, a comprehensive overview of this field and current developments is paramount. There seems to be a plethora of work on the topic and the difficult task will be to sort out the relevant contributions. The starting point for the investigation will be the overview of the field by Kalfoglou \cite{Kalfoglou2003} and a more recent summary of the key challenges by Shvaiko and Euzenat \cite{Shvaiko2008}.
18
19In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
20
21One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
22
23\subsection*{Linked Open Data} 
24As described previously one outcome of the work will be the dataset expressed in RDF interlinked with other semantic resources.
25This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
26
27----------------------------
28
29\subsection{Language Resources and Technology}
30
31While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
32
33Need some number about the disparity in the field, number of institutes, resources, formats.
34
35This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
36
37\subsubsection{CLARIN}
38
39CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
40
41    create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable
42
43This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas.
44
45The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN.
46CLARIN/NLP for SSH
47
48\subsubsection{Standards}
49
50\begin{description}
51\item[ISO12620] Data Category Registry
52\item[LAF] Linguistic Annotation Framework
53\item[CMDI] - (DC, OLAC, IMDI, TEI)
54\end{description}
55
56\subsubsection{NLP MD Catalogues}
57
58\begin{description}
59\item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology}
60\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
61\item[OLAC]
62\item[ELRA]
63\item[LDC]
64\item[DFKI/LT-World]
65\end{description}
66
67\subsection{Ontologies}
68
69\subsubsection{Word, Sense, Concept}
70
71Lexicon vs. Ontology
72Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
73And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
74So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
75
76A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
77ontologies conceptualizing the linguistic domain
78
79They are special in that ("ontologized") Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
80Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
81
82a) as domain ontologies, describing aspects of the Resources\\
83b) as linguistic ontologies enriching the Lexicalization of Concepts
84
85Ontology and Lexicon \cite{Hirst2009}
86
87LingInfo/Lemon \cite{Buitelaar2009}
88
89We shouldn't need linguistic ontologies (LingInfo, LEmon), they are primarily relevant in the task of ontology population from texts, where the entities can be encountered in various word-forms in the context of the text.
90(Ontology Learning, Ontology-based Semantic Annotation of Text)
91And we are dealing with highly structured data with referenced in their nominal(?) form.
92
93Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
94So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
95
96controlled vocabularies?
97
98
99
100\subsubsection{Semantic Web - Linked Data}
101
102\begin{description}
103\item[RDF/OWL]
104\item[SKOS]
105\end{description}
106
107\subsubsection{OntologyMapping}
108
109
110\subsection{Visualization}
111
112
113\subsection{FederatedSearch}
114
115\subsubsection{Standards}
116
117\begin{description}
118\item[Z39.50/SRU/SRW/CQL] LoC
119\item[OAI-PMH]
120\end{description}
121
122
123\subsubsection{(Digital) Libraries}
124
125
126General (Libraries, Federations):
127
128\begin{description}
129\item[OCLC] \url{http://www.oclc.org}
130    world's biggest Library Federation
131\item[LoC] Library of Congress \url{http://www.loc.gov}
132\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
133\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
134\end{description}
135
136\subsubsection{Content Repositories}
137
138\begin{description}
139\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/}
140\item[eSciDoc]  provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/}
141\item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/}
142\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/}
143\end{description}
144
145
146\subsubsection{(MD)search frameworks:}
147
148\begin{description}
149\item[Zebra/Z39.50] JZKit
150\item[Lucene/Solr]
151\item[eXist] - xml DB
152\end{description}
153
154\subsubsection{Content/Corpus Search}
155Corpus Search Systems
156\begin{description}
157\item[DDC]  - text-corpus
158\item[manatee] - text-corpus
159\item[CQP] - text-corps
160\item[TROVA] - MM annotated resources
161\item[ELAN] - MM annotated resources (editor + search)
162\end{description}
163
164\subsection{Summary}
165
Note: See TracBrowser for help on using the repository browser.