source: SMC4LRT/Literature.tex @ 2669

Last change on this file since 2669 was 2669, checked in by vronk, 11 years ago

sections in separate files

File size: 11.6 KB
Line 
1
2\section{Previous Work}
3\label{SotA}
4
5\subsection*{Infrastructure Components}
6There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings.
7
8\texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}}
9are two integral components of the \textit{CLARIN Metadata Infrastructure} maintaining the normative information. Especially \texttt{ISOcat} -- the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse -- is the definitive primary reference vocabulary \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase.
10
11And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project.
12
13\noindent
14All these components are running services, that this work shall directly build upon.
15
16\subsection*{LRT  Resources}
17The CLARIN project also delivers a valuable source of information on the normative resources in the domain in its current deliverable on \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3}. Next to covering ontologies as one type of resources this document offers an exhaustive collection of references to standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
18
19Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
20
21\subsection*{Ontology Mapping} 
22As the main contribution shall be the application of \emph{ontology mapping} techniques and technology, a comprehensive overview of this field and current developments is paramount. There seems to be a plethora of work on the topic and the difficult task will be to sort out the relevant contributions. The starting point for the investigation will be the overview of the field by Kalfoglou \cite{Kalfoglou2003} and a more recent summary of the key challenges by Shvaiko and Euzenat \cite{Shvaiko2008}.
23
24In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
25
26One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
27
28\subsection*{Linked Open Data} 
29As described previously one outcome of the work will be the dataset expressed in RDF interlinked with other semantic resources.
30This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task.
31
32----------------------------
33
34\subsection{Language Resources and Technology}
35
36While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought.
37
38Need some number about the disparity in the field, number of institutes, resources, formats.
39
40This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN.
41
42\subsubsection{CLARIN}
43
44CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is
45
46    create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable
47
48This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas.
49
50The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN.
51CLARIN/NLP for SSH
52
53\subsubsection{Standards}
54
55\begin{description}
56\item[ISO12620] Data Category Registry
57\item[LAF] Linguistic Annotation Framework
58\item[CMDI] - (DC, OLAC, IMDI, TEI)
59\end{description}
60
61\subsubsection{NLP MD Catalogues}
62
63\begin{description}
64\item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology}
65\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
66\item[OLAC]
67\item[ELRA]
68\item[LDC]
69\item[DFKI/LT-World]
70\end{description}
71
72\subsection{Ontologies}
73
74\subsubsection{Word, Sense, Concept}
75
76Lexicon vs. Ontology
77Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical.
78And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum.
79So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent.
80
81A special case are Linguistic Ontologies: isocat, GOLD, WALS.info
82ontologies conceptualizing the linguistic domain
83
84They are special in that ("ontologized") Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
85Lexicalized Ontologies: LingInfo, lemon: LMF +  isocat/GOLD +  Domain Ontology
86
87a) as domain ontologies, describing aspects of the Resources\\
88b) as linguistic ontologies enriching the Lexicalization of Concepts
89
90Ontology and Lexicon \cite{Hirst2009}
91
92LingInfo/Lemon \cite{Buitelaar2009}
93
94We shouldn't need linguistic ontologies (LingInfo, LEmon), they are primarily relevant in the task of ontology population from texts, where the entities can be encountered in various word-forms in the context of the text.
95(Ontology Learning, Ontology-based Semantic Annotation of Text)
96And we are dealing with highly structured data with referenced in their nominal(?) form.
97
98Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept.
99So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~.
100
101controlled vocabularies?
102
103
104
105\subsubsection{Semantic Web - Linked Data}
106
107\begin{description}
108\item[RDF/OWL]
109\item[SKOS]
110\end{description}
111
112\subsubsection{OntologyMapping}
113
114
115\subsection{Visualization}
116
117
118\subsection{FederatedSearch}
119
120\subsubsection{Standards}
121
122\begin{description}
123\item[Z39.50/SRU/SRW/CQL] LoC
124\item[OAI-PMH]
125\end{description}
126
127
128\subsubsection{(Digital) Libraries}
129
130
131General (Libraries, Federations):
132
133\begin{description}
134\item[OCLC] \url{http://www.oclc.org}
135    world's biggest Library Federation
136\item[LoC] Library of Congress \url{http://www.loc.gov}
137\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
138\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
139\end{description}
140
141\subsubsection{Content Repositories}
142
143\begin{description}
144\item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/}
145\item[eSciDoc]  provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/}
146\item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/}
147\item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/}
148\end{description}
149
150
151\subsubsection{(MD)search frameworks:}
152
153\begin{description}
154\item[Zebra/Z39.50] JZKit
155\item[Lucene/Solr]
156\item[eXist] - xml DB
157\end{description}
158
159\subsubsection{Content/Corpus Search}
160Corpus Search Systems
161\begin{description}
162\item[DDC]  - text-corpus
163\item[manatee] - text-corpus
164\item[CQP] - text-corps
165\item[TROVA] - MM annotated resources
166\item[ELAN] - MM annotated resources (editor + search)
167\end{description}
168
169\subsection{Summary}
170
Note: See TracBrowser for help on using the repository browser.