Context Navigation

source: SMC4LRT/chapters/Literature.tex @ 3671

Last change on this file since 3671 was 3671, checked in by vronk, 11 years ago

File size: 21.4 KB

Line
1	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2	\chapter{State of the Art}
3	\label{ch:lit}
4	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5
6	In this chapter we give a short overview of the development of large research infrastructures (with focus on those for language resources and technology), then we examine in more detail the hoist of work (methods and systems) on schema/ontology matching
7	and review Semantic Web principles and technologies.
8
9	Note though that substantial parts of state of the art coverage are outsourced into separate chapters: A broad analysis of the data is provided in separate chapter \ref{ch:data} and a detailed description of the underlying infrastructure is found in \ref{ch:infra}.
10
11	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
12	\section{Research Infrastructures (for Language Resources and Technology)}
13	In recent years, multiple large-scale initiatives have set out to combat the fragmented nature of the language resources landscape in general and the metadata interoperability problems in particular.
14
15	\xne{EAGLES/ISLE Meta Data Initiative} (IMDI) \cite{wittenburg2000eagles} 2000 to 2003 proposed a standard for metadata descriptions of Multi-Media/Multi-Modal Language Resources aiming at easing access to Language Resources and thus increases their reusability.
16
17	\xne{FLaReNet}\furl{http://www.flarenet.eu/} -- Fostering Language Resources Network -- running 2007 to 2010 concentrated rather on ``community and consensus building'' developing a common vision and mapping the field of LRT via survey.
18
19	\xne{CLARIN} -- Common Language Resources and Technology Infrastructure -- large research infrastructure providing sustainable access for scholars in the humanities and social sciences to digital language data, and especially its technical core the Component Metadata Infrastructure (CMDI) -- a comprehensive architecture for harmonized handling of metadata\cite{Broeder2011} --
20	are the primary context of this work, therefore the description of this underlying infrastructure is detailed in separate chapter \ref{ch:infra}.
21	Both above-mentioned projects can be seen as predecessors to CLARIN, the IMDI metadata model being one starting point for the development of CMDI.
22
23	More of a sister-project is the initiative \xne{DARIAH} - Digital Research Infrastructure for the Arts and Humanities\furl{http://dariah.eu}. It has a broader scope, but has many personal ties as well as similar problems and similiar solutions as CLARIN. Therefore there are efforts to intensify the cooperation between these two research infrastructures for digital humanities.
24
25	\xne{META-SHARE} is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu}
26
27	\begin{quotation}
28	META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee.
29	\end{quotation}
30
31	See \ref{def:META-SHARE} for more details about META-SHARE's catalog and metadata format.
32
33
34	\subsubsection{Digital Libraries}
35
36	In a broader view we should also regard the activities in the world of libraries.
37	Starting already in 1970's with connecting, exchanging and harmonizing their bibliographic catalogs, they certainly have a long tradition, wealth of experience and stable solutions.
38
39	Mainly driven by national libraries still bigger aggregations of the bibliographic data are being set up.
40	The biggest one being the \xne{Worldcat}\furl{http://www.worldcat.org/} (totalling 273.7 million records \cite{OCLCAnnualReport2012})
41	powered by OCLC, a cooperative of over 72.000 libraries worldwide.
42
43	In Europe, more recent initiatives have pursuit similar goals:
44	\xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
45
46	\xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} has even broader scope, serving as meta-aggregator and portal for European digitised works, encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana). The auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
47
48	Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.
49
50	A number of catalogs and formats are further described in the section \ref{sec:other-md-catalogs}
51
52
53	\section{Schema / Ontology Mapping/Matching}
54
55	Schema or ontology matching provides the methodical foundation for the problem at hand the \emph{semantic mapping}.
56	As Shvaiko\cite{shvaiko2012ontology} states ``a solution to the semantic heterogeneity problem. It finds correspondences between semantically related entities of ontologies.''
57
58	One starting point for the plethora of work in the field of \emph{schema and ontology mapping} techniques and technology
59	is the overview of the field by Kalfoglou \cite{Kalfoglou2003}.
60	Shvaiko and Euzenat provide a summary of the key challenges\cite{Shvaiko2008} as well as a comprehensive survey of approaches for schema and ontology matching based on a proposed new classification of schema-based matching techniques\cite{Shvaiko2005_classification}.
61	Noy \cite{Noy2005_ontologyalignment,Noy2004_semanticintegration}
62
63	and more recently \cite{shvaiko2012ontology}(2012!) and \cite{amrouch2012survey} provide surveys of the methods and systems in the field.
64
65	\paragraph{Methods}
66	Semantic and extensional methods are still rarely
67	employed by the matching systems. In fact, most of
68	the approaches are quite often based only on
69	terminological and structural methods
70
71	classify, review, and experimentally compare major methods of element similarity measures and their combinations.\cite{Algergawy2010}
72
73	\subsubsection{Systems}
74	A number of existing systems for schema/ontology matching/alignment is mentioned in this overview publications:
75
76	The majority of tools for ontology mapping use some sort of structural or
77	definitional information to discover new mappings. This information includes
78	such elements as subclassâsuperclass relationships, domains and ranges of
79	properties, analysis of the graph structure of the ontology, and so on. Some of
80	the tools in this category include
81
82	IF-Map\cite{kalfoglou2003if}
83
84	QOM\cite{ehrig2004qom},
85
86	Similarity Flooding\cite{melnik}
87
88	the Prompt tools \cite{Noy2003_theprompt} integrating with Protege
89
90
91	\xne{COMA++} \cite{Aumueller2005} composite approach to combine different match algorithms, user interaction via graphical interface , supports W3C XML Schema and OWL.
92
93	\xne{FOAM}\cite{EhrigSure2005}
94
95
96	Ontology matching system \xne{LogMap 2} \cite{jimenez2012large} supports user interaction and implements scalable reasoning and diagnosis algorithms, which minimise any logical inconsistencies introduced by the matching process.
97	The process is divided into two main logical phases: computation of mapping candidates (maximise recall) and assessment of the candidates (maximize precision).
98
99
100	s which are at the core of the mapping task.
101
102
103	On the dedicated platform OAEI\footnote{Ontology Alignment Evalution Intiative - \url{http://oaei.ontologymatching.org/}} an ongoing effort is being carried out and documented comparing various alignment methods applied on different domains.
104
105
106	One more specific recent inspirational work is that of Noah et. al \cite{Noah2010} developing a semantic digital library for an academic institution. The scope is limited to document collections, but nevertheless many aspects seem very relevant for this work, like operating on document metadata, ontology population or sophisticated querying and searching.
107
108	Matching is laborious and error-prone process, and once ontology
109	mappings are discovered, i
110
111	\subsection{MOVEOUT: Application of Schema Matching on the CMD domain}
112	Notice, that this the semantic interoperability layer built into the core of the CMD Infrastructure, integrates the
113	task of identifying semantic correspondences directly into the process of schema creation,
114	largely removing the need for complex schema matching/mapping techniques in the post-processing.
115	However this is only holds for schemas already created within the CMD framework,
116	Given the growing universe of definitions (data categories and components) in the CMD framework the metadata modeller could very well profit from applying schema mapping techniques as pre-processing step in the task of integrating existing external schemas into the infrastructure. User involvement is identified by \cite{shvaiko2012ontology} as one of promising future challenges to ontology matching
117
118	Such a procedure pays tribute to the fact, that the mapping techniques are mostly error-prone and can deliver reliable 1:1 alignments only in trivial cases. This lies in the nature of the problem, given the heterogenity of the schemas present in the data collection, full alignments are not achievable at all, only parts of individual schemas actually semantically correspond.
119
120	Once all the equivalencies (and other relations) between the profiles/schemas were found, simliarity ratios can be determined.
121
122	The task can be also seen as building bridge between XML resources and semantic resources expressed in RDF, OWL.
123	This speaks for a tool like COMA++ supporting both W3C standards: XML Schema and OWL.
124	Concentration on existing systems with user interface?
125
126	The process of expressing the whole of the data as one semantic resource, can be also understood as schema or ontology merging task. Data categories being the primary mapping elements
127
128
129	In the end
130	It is also not the goal to merge
131
132	Being only a pre-processing step meant to provide suggestions to the human modeller implies higher importance to recall than to precision.
133
134
135
136	infrastructure un
137
138	This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
139
140
141	Application of ontology/schema matching/mapping techniques
142	is reduced or outsourced
143
144
145	\subsection{Existing Crosswalk services}
146
147
148	\url{http://www.oclc.org/developer/services/metadata-crosswalk-service}
149
150
151	VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
152
153	http://www.dnb.de/rdf
154
155
156	the entire WorldCat cataloging collection made publicly
157	available using Schema.org mark-up with library extensions for use by developers and
158	search partners such as Bing, Google, Yahoo! and Yandex
159
160	OCLC begins adding linked data to WorldCat by appending
161	Schema.org descriptive mark-up to WorldCat.org pages, thereby
162	making OCLC member library data available for use by intelligent
163	Web crawlers such as Google and Bing
164
165	%%%%%%%%%%%%%%%%%%%%%%%%%%5
166	\section{Semantic Web -- Linked Open Data}
167
168	Linked Data paradigm\cite{TimBL2006} for publishing data on the web is increasingly been taken up by data providers across many disciplines \cite{bizer2009linked}. \cite{HeathBizer2011} gives comprehensive overview of the principles of Linked Data with practical examples and current applications.
169
170	\subsection{Semantic Web - Technical solutions / Server applications}
171
172	The provision of the produced semantic resources on the web requires technical solutions to store the RDF triples, query them efficiently
173	and idealiter expose them via a web interface to the users.
174
175	Meanwhile a number of RDF triple store solutions relying both on native, DBMS-backed or hybrid persistence layer are available, open-source solutions like \xne{Jena, Sesame} or \xne{BigData} as well as a number of commercial solutions \xne{AllegroGraph, OWLIM, Virtuoso}.
176
177	A qualitative and quantitative study\cite{Haslhofer2011europeana} in the context of Europeana evaluated a number of RDF stores (using the whole Europeana EDM data set = 382,629,063 triples as data load) and came to the conclusion, that ``certain RDF stores, notably OpenLink Virtuoso and 4Store'' can handle the large test dataset.
178
179	\xne{OpenLink Virtuoso Universal Server}\furl{http://virtuoso.openlinksw.com} is hybrid storage solution for a range of data models, including relational data, RDF and XML, and free text documents.\cite{Erling2009Virtuoso, Haslhofer2011europeana}
180	Virtuoso is used to host many important Linked Data sets (e.g., DBpedia\furl{http://dbpedia.org} \cite{auer2007dbpedia}).
181	Virtuoso is offered both as commercial and open-source version license models exist.
182
183	Another solution worth examining is the \xne{Linked Media Framework}\furl{http://code.google.com/p/lmf/} -- ``easy-to-setup server application that bundles together three Apache open source projects to offer some advanced services for linked media management'': publishing legacy data as linked data, semantic search by enriching data with content from the Linked Data Cloud, using SKOS thesaurus for information extraction.
184
185	\begin{comment}
186	LDpath\furl{http://code.google.com/p/ldpath/}
187	`` a simple path-based query language similar to XPath or SPARQL Property Paths that is particularly well-suited for querying and retrieving resources from the Linked Data Cloud by following RDF links between resources and servers. ''
188
189	Linked Data browser
190
191	Haystack\furl{http://en.wikipedia.org/wiki/Haystack_(PIM)}
192	\end{comment}
193
194	\subsection{Ontology Visualization}
195
196	Landscape, Treemap, SOM
197
198	\todoin{check Ontology Mapping and Alignement / saiks/Ontology4 4auf1.pdf}
199
200
201	\section{Language and Ontologies}
202
203	There are two different relation links betwee language or linguistics and ontologies: a) `linguistic ontologies' domain ontologies conceptualizing the linguistic domain, capturing aspects of linguistic resources; b) `lexicalized' ontologies, where ontology entities are enriched with linguistic, lexical information.
204
205	\subsubsection{Linguistic ontologies}
206
207	One prominent instance of a linguistic ontology is \xne{General Ontology for Linguistic Description} or GOLD\cite{Farrar2003}\furl{http://linguistics-ontology.org},
208	that ``gives a formalized account of the most basic categories and relations (the "atoms") used in the scientific description of human language, attempting to codify the general knowledge of the field. The motivation is to`` facilite automated reasoning over linguistic data and help establish the basic concepts through which intelligent search can be carried out''.
209
210	In line with the aspiration ``to be compatible with the general goals of the Semantic Web'', the dataset is provided via a web application as well as a dump in OWL format\furl{http://linguistics-ontology.org/gold-2010.owl} \cite{GOLD2010}.
211
212
213	Founded in 1934, SIL International\furl{http://www.sil.org/about-sil} (originally known as the Summer Institute of Linguistics, Inc) is a leader in the identification and documentation of the world's languages. Results of this research are published in Ethnologue: Languages of the World\furl{http://www.ethnologue.com/} \cite{grimes2000ethnologue}, a comprehensive catalog of the world's nearly 7,000 living languages. SIL also maintains Language \& Culture Archives a large collection of all kinds resources in the ethnolinguistic domain \furl{http://www.sil.org/resources/language-culture-archives}.
214
215	World Atlas of Language Structures (WALS) \furl{http://WALS.info} \cite{wals2011}
216	is ``a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) ''. First appeared 2005, current online version published in 2011 provides a compendium of detailed expert definitions of individual linguistic features, accompanied by a sophisticated web interface integrating the information on linguistic features with their occurrence in the world languages and their geographical distribution.
217
218	Simons \cite{Simons2003developing} developed a Semantic Interpretation Language (SIL) that is used to define the meaning of the elements and attributes in an XML markup schema in terms of abstract concepts defined in a formal semantic schema
219	Extending on this work, Simons et al. \cite{Simons2004semantics} propose a method for mapping linguistic descriptions in plain XML into semantically rich RDF/OWL, employing the GOLD ontology as the target semantic schema.
220
221	These ontologies can be used by (``ontologized'') Lexicons refer to them to describe linguistic properties of the Lexical Entries, as opposed to linking to Domain Ontologies to anchor Senses/Meanings.
222
223
224	Work on Semantic Interpretation Language as well as the GOLD ontology can be seen as conceptual predecessor of the Data Category Registry a ISO-standardized procedure for defining and standardizing ``widely accepted linguistic concepts'', that is at the core of the CLARIN's metadata infrastructure (cf. \ref{def:DCR}).
225	Although not exactly an ontology in the common sense of
226	Although (by design) this registry does not contain any relations between concepts,
227	the central entities are concepts and not lexical items, thus it can be seen as a proto-ontology.
228	Another indication of the heritage is the fact that concepts of the GOLD ontology were migrated into ISOcat (495 items) in 2010.
229
230	Notice that although this work is concerned with language resources, it is primarily on the metadata level, thus the overlap with linguistic ontologies codifying the terminology of the discipline linguistic is rather marginal (perhaps on level of description of specific linguistic aspects of given resources).
231
232	\subsubsection{Lexicalised ontologies,``ontologized'' lexicons}
233
234
235	The other type of relation between ontologies and linguistics or language are lexicalised ontologies. Hirst \cite{Hirst2009} elaborates on the differences between ontology and lexicon and the possibility to reuse lexicons for development of ontologies.
236
237	In a number of works Buitelaar, McCrae et. al \cite{Buitelaar2009, buitelaar2010ontology, McCrae2010c, buitelaar2011ontology, Mccrae2012interchanging} argues for ``associating linguistic information with ontologies'' or ``ontology lexicalisation'' and draws attention to lexical and linguistic issues in knowledge representation in general. This basic idea lies behind the series of proposed models \xne{LingInfo}, \xne{LexOnto}, \xne{LexInfo} and, most recently, \xne{lemon} aimed at allowing complex lexical information for such ontologies and for describing the relationship between the lexicon and the ontology.
238	The most recent in this line, \xne{lemon} or \xne{lexicon model for ontologies} defines ``a formal model for the proper representation of the continuum between: i) ontology semantics; ii) terminology that is used to convey this in natural
239	language; and iii) linguistic information on these terms and their constituent lexical units'', in essence enabling the creation of a lexicon for a given ontology, adopting the principle of ``semantics by reference", no complex semantic in-
240	formation needs to be stated in the lexicon.
241	a clear separation of the lexical layer and the ontological layer.
242
243	Lemon builds on existing work, next to the LexInfo and LIR ontology-lexicon models.
244	and in particular on global standards: W3C standard: SKOS (Simple Knowledge Organization System) \cite{SKOS2009} and ISO standards the Lexical Markup Framework (ISO 24613:2008 \cite{ISO24613:2008}) and
245	and Specification of Data Categories, Data Category Registry (ISO 12620:2009 \cite{ISO12620:2009})
246
247	Lexical Markup Framework LMF \cite{Francopoulo2006LMF, ISO24613:2008} defines a metamodel for representing data in lexical databases used with monolingual and multilingual computer applications.
248
249	An overview of current developments in application of the linked data paradigm for linguistic data collections was given at the workshop Linked Data in Linguistics\furl{http://ldl2012.lod2.eu/} 2012 \cite{ldl2012}.
250
251
252	The primary motivation for linguistic ontologies like \xne{lemon} are the tasks ontology-based information extraction, ontology learning and population from text, where the entities are often referred to by non-nominal word forms and with ambiguous semantics. Given, that the discussed collection contains mainly highly structured data referencing entities in their nominal form, linguistic ontologies are not directly relevant for this work.
253
254
255	\section{Summary}
256	This chapter concentrated on the current affairs/developments regarding the infrastructures for Language Resources and Technology and on the other hand gave an overview of the state of the art regarding methods to be applied in this work: Semantic Web Technologies, Ontology Mapping and Ontology Visualization.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: