1 | |
---|
2 | \section{Previous Work} |
---|
3 | \label{SotA} |
---|
4 | |
---|
5 | \subsection*{Infrastructure Components} |
---|
6 | There are multiple relevant activities being carried out in the context of research infrastructure initiatives for LRT. The most relevant ongoing effort is the \texttt{VLO - Virtual Language Observatory}\footnote{\url{http://www.clarin.eu/vlo/}}\cite{VanUytvanck2010}, being developed within the CLARIN project. This application operates on roughly the same collection of data as is discussed in this work, however it employs a faceted search, mapping manually the appropriate metadata fields from the different schemas to 8 fixed facets. Although this is a very reductionist approach it is certainly a great starting point offering a core set of categories together with an initial set of category mappings. |
---|
7 | |
---|
8 | \texttt{Component Registry} and \texttt{ISOcat}\footnote{\url{http://www.isocat.org/}} |
---|
9 | are two integral components of the \textit{CLARIN Metadata Infrastructure} maintaining the normative information. Especially \texttt{ISOcat} -- the ISO-standardized Data Category Registry for registering and maintaining \texttt{Data Categories} as globally agreed upon incarnations of concepts in the domain of discourse -- is the definitive primary reference vocabulary \cite{Broeder2010,ISO12620:2009}. A tightly related work is that on the so called \texttt{Relation Registry}, a separate component that allows to define arbitrary relations between data categories, however this activity is rather in an early prototypical phase. |
---|
10 | |
---|
11 | And a last relevant intiative to mention is that of a \texttt{Vocabulary Alignment Service} being developed and run within the Dutch program CATCH\footnote{\textit{Continuous Access To Cultural Heritage} - \url{http://www.catchplus.nl/en/}}, which serves as a neutral manager and provider of controlled vocabularies. There are plans to reuse or enhance this service for the needs of the CLARIN project. |
---|
12 | |
---|
13 | \noindent |
---|
14 | All these components are running services, that this work shall directly build upon. |
---|
15 | |
---|
16 | \subsection*{LRT Resources} |
---|
17 | The CLARIN project also delivers a valuable source of information on the normative resources in the domain in its current deliverable on \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3}. Next to covering ontologies as one type of resources this document offers an exhaustive collection of references to standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. |
---|
18 | |
---|
19 | Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} |
---|
20 | |
---|
21 | \subsection*{Ontology Mapping} |
---|
22 | As the main contribution shall be the application of \emph{ontology mapping} techniques and technology, a comprehensive overview of this field and current developments is paramount. There seems to be a plethora of work on the topic and the difficult task will be to sort out the relevant contributions. The starting point for the investigation will be the overview of the field by Kalfoglou \cite{Kalfoglou2003} and a more recent summary of the key challenges by Shvaiko and Euzenat \cite{Shvaiko2008}. |
---|
23 | |
---|
24 | In their rather theoretical work Ehrig and Sure \cite{EhrigSure2004} elaborate on the various similarity measures which are at the core of the mapping task. On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains. |
---|
25 | |
---|
26 | One more specific recent inspirative work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching. |
---|
27 | |
---|
28 | \subsection*{Linked Open Data} |
---|
29 | As described previously one outcome of the work will be the dataset expressed in RDF interlinked with other semantic resources. |
---|
30 | This is very much in line with the broad \textit{Linked Open Data} effort as proposed by Berners-Lee \cite{TimBL2006} and being pursuit across many discplines. (This topic is supported also by the EU Commission within the FP7.\footnote{\url{http://cordis.europa.eu/fetch?CALLER=PROJ\_ICT&ACTION=D&CAT=PROJ&RCN=95562}}) A very recent comprehensive overview of the principles of Linked Data and current applications is the book by Heath and Bizer \cite{HeathBizer2011}, that shall serve as a practical guide for this specific task. |
---|
31 | |
---|
32 | ---------------------------- |
---|
33 | |
---|
34 | \subsection{Language Resources and Technology} |
---|
35 | |
---|
36 | While in the Digital Libraries community a consolidation generally already happened and big federated networks of digital libary repository are set up, in the field of Language Resource and Technology the landscape is still scattered, although meanwhile looking back at a decade of standardizing efforts. One main reason seems to be the complexity and diversity of the metadata associated with the resources, stemming for one from the wide range of resource types additionally complicated by dependence of different schools of thought. |
---|
37 | |
---|
38 | Need some number about the disparity in the field, number of institutes, resources, formats. |
---|
39 | |
---|
40 | This situation has been identified by the community and multiple standardization initiatives had been conducted/undertaken. This process seems to have gained a new momentum thanks to large Research Infrastructure Programmes introduced by European Commission, aimed at fostering Research communities developing large-scale pan-european common infrastructures. One key player in this development is the project CLARIN. |
---|
41 | |
---|
42 | \subsubsection{CLARIN} |
---|
43 | |
---|
44 | CLARIN - Common Language Resource and Technology Infrastructure - constituted by over 180 members from round 38 countries. The mission of this project is |
---|
45 | |
---|
46 | create a research infrastructure that makes language resources and technologies (LRT) available to scholars of all disciplines, especially SSH large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily useable |
---|
47 | |
---|
48 | This shall be accomplished by setting up a federated network of centers (with federated identity management) but mainly providing resources and services in an agreed upon / coherent / uniform / consistent /standardized manner. The foundation for this goal shall be the Common or Component Metadata infrastructure, a model that caters for flexible metadata profiles, allowing to accomodate existing schemas. |
---|
49 | |
---|
50 | The embedment in the CLARIN project brings about the context of Language Resources and HLT (Human Language Technology, aka NLP - Natural Language Processing) and SSH (Social Sciences and Humanities) as the primary target user-group of CLARIN. |
---|
51 | CLARIN/NLP for SSH |
---|
52 | |
---|
53 | \subsubsection{Standards} |
---|
54 | |
---|
55 | \begin{description} |
---|
56 | \item[ISO12620] Data Category Registry |
---|
57 | \item[LAF] Linguistic Annotation Framework |
---|
58 | \item[CMDI] - (DC, OLAC, IMDI, TEI) |
---|
59 | \end{description} |
---|
60 | |
---|
61 | \subsubsection{NLP MD Catalogues} |
---|
62 | |
---|
63 | \begin{description} |
---|
64 | \item[LAT, TLA] - Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \url{http://www.mpi.nl/research/research-projects/language-archiving-technology} |
---|
65 | \item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/} |
---|
66 | \item[OLAC] |
---|
67 | \item[ELRA] |
---|
68 | \item[LDC] |
---|
69 | \item[DFKI/LT-World] |
---|
70 | \end{description} |
---|
71 | |
---|
72 | \subsection{Ontologies} |
---|
73 | |
---|
74 | \subsubsection{Word, Sense, Concept} |
---|
75 | |
---|
76 | Lexicon vs. Ontology |
---|
77 | Lexicon is a linguistic object an ontology is not.\cite{Hirst2009} We don't need to be that strict, but it shall be a guiding principle in this work to consider things (Datasets, Vocabularies, Resources) also along this dichotomy/polarity: Conceptual vs. Lexical. |
---|
78 | And while every Ontology has to have a lexical representation (canonically: rdfs:label, rdfs:comment, skos:*label), if we don't try to force observed objects into a binary classification, but consider a bias spectrum, we should be able to locate these along this spectrum. |
---|
79 | So the main focus of a typical ontology are the concepts ("conceptualization"), primarily language-independent. |
---|
80 | |
---|
81 | A special case are Linguistic Ontologies: isocat, GOLD, WALS.info |
---|
82 | ontologies conceptualizing the linguistic domain |
---|
83 | |
---|
84 | They are special in that ("ontologized") Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings. |
---|
85 | Lexicalized Ontologies: LingInfo, lemon: LMF + isocat/GOLD + Domain Ontology |
---|
86 | |
---|
87 | a) as domain ontologies, describing aspects of the Resources\\ |
---|
88 | b) as linguistic ontologies enriching the Lexicalization of Concepts |
---|
89 | |
---|
90 | Ontology and Lexicon \cite{Hirst2009} |
---|
91 | |
---|
92 | LingInfo/Lemon \cite{Buitelaar2009} |
---|
93 | |
---|
94 | We shouldn't need linguistic ontologies (LingInfo, LEmon), they are primarily relevant in the task of ontology population from texts, where the entities can be encountered in various word-forms in the context of the text. |
---|
95 | (Ontology Learning, Ontology-based Semantic Annotation of Text) |
---|
96 | And we are dealing with highly structured data with referenced in their nominal(?) form. |
---|
97 | |
---|
98 | Another special case are Controlled Vocabularies or Taxonomies/Classification Systems, let alone folksonomies, in that they identify terms and concepts/meanings, ie there is no explicit mapping between the language represenation and the concept, but rather the term is implicit carrier of the meaning/concept. |
---|
99 | So for example in the LCSH the surface realization of each subject-heading at the same time identifies the Concept ~. |
---|
100 | |
---|
101 | controlled vocabularies? |
---|
102 | |
---|
103 | |
---|
104 | |
---|
105 | \subsubsection{Semantic Web - Linked Data} |
---|
106 | |
---|
107 | \begin{description} |
---|
108 | \item[RDF/OWL] |
---|
109 | \item[SKOS] |
---|
110 | \end{description} |
---|
111 | |
---|
112 | \subsubsection{OntologyMapping} |
---|
113 | |
---|
114 | |
---|
115 | \subsection{Visualization} |
---|
116 | |
---|
117 | |
---|
118 | \subsection{FederatedSearch} |
---|
119 | |
---|
120 | \subsubsection{Standards} |
---|
121 | |
---|
122 | \begin{description} |
---|
123 | \item[Z39.50/SRU/SRW/CQL] LoC |
---|
124 | \item[OAI-PMH] |
---|
125 | \end{description} |
---|
126 | |
---|
127 | |
---|
128 | \subsubsection{(Digital) Libraries} |
---|
129 | |
---|
130 | |
---|
131 | General (Libraries, Federations): |
---|
132 | |
---|
133 | \begin{description} |
---|
134 | \item[OCLC] \url{http://www.oclc.org} |
---|
135 | world's biggest Library Federation |
---|
136 | \item[LoC] Library of Congress \url{http://www.loc.gov} |
---|
137 | \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm} |
---|
138 | \item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/} |
---|
139 | \end{description} |
---|
140 | |
---|
141 | \subsubsection{Content Repositories} |
---|
142 | |
---|
143 | \begin{description} |
---|
144 | \item[PHAIDRA] Permanent Hosting, Archiving and Indexing of Digital Resources and Assets, provided by Vienna University \url{https://phaidra.univie.ac.at/} |
---|
145 | \item[eSciDoc] provided by MPG + FIZ Karlsruhe \url{https://www.escidoc.org/} |
---|
146 | \item[DRIVER] pan-European infrastructure of Digital Repositories \url{http://www.driver-repository.eu/} |
---|
147 | \item[OpenAIRE] - Open Acces Infrastructure for Research in Europe \url{http://www.openaire.eu/} |
---|
148 | \end{description} |
---|
149 | |
---|
150 | |
---|
151 | \subsubsection{(MD)search frameworks:} |
---|
152 | |
---|
153 | \begin{description} |
---|
154 | \item[Zebra/Z39.50] JZKit |
---|
155 | \item[Lucene/Solr] |
---|
156 | \item[eXist] - xml DB |
---|
157 | \end{description} |
---|
158 | |
---|
159 | \subsubsection{Content/Corpus Search} |
---|
160 | Corpus Search Systems |
---|
161 | \begin{description} |
---|
162 | \item[DDC] - text-corpus |
---|
163 | \item[manatee] - text-corpus |
---|
164 | \item[CQP] - text-corps |
---|
165 | \item[TROVA] - MM annotated resources |
---|
166 | \item[ELAN] - MM annotated resources (editor + search) |
---|
167 | \end{description} |
---|
168 | |
---|
169 | \subsection{Summary} |
---|
170 | |
---|