Context Navigation

source: SMC4LRT/chapters/Data.tex @ 3776

Last change on this file since 3776 was 3776, checked in by vronk, 11 years ago
final layout cleaning; backup
File size: 39.5 KB

Line
1
2	\chapter{Analysis of the data landscape}
3	\label{ch:data}
4	This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of.
5
6
7	\section{Component Metadata Framework}
8	\label{def:CMD}
9
10	The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
11	CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
12	The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus
13	indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
14
15	%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
16
17	While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
18
19	Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records.
20	The generated schema also conveys as annotation the information about the referenced data categories.
21
22
23	\subsection{CMD Profiles }
24	In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
25
26	Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\concept{dublincore}, \concept{collection}, the set of \concept{Bamdes}-profiles) there are complex profiles with up to 10 levels (\concept{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 distinct components and 337 elements (or 419 components and 1587 elements when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \concept{Contact}) included by three other components (\concept{Project}, \concept{Institution}, \concept{Access}) will appear three times in the instantiated record.}).
27
28
29	\begin{table}
30	\caption{The development of defined profiles and DCs over time}
31	\label{table:dev_profiles}
32	% \begin{tabular}{ l \| r \| r \| r \| r }
33	\begin{tabular}{ l r r r r }
34
35	\hline
36	date & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\
37	\hline
38	Profiles & 40 & 53 & 87 & 124 \\
39	Distinct Components & 164 & 298 & 542 & 828 \\
40	Expanded Components & 1055 & 1536 & 2904 & 5757 \\
41	Distinct Elements & 511 & 893 & 1505 & 2399 \\
42	Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
43	Distinct data categories & 203 & 266 & 436 & 499 \\
44	Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
45	Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
46	Components with DCs & 28 & 67 & 115 & 140 \\
47
48	\hline
49	\end{tabular}
50	\end{table}
51
52
53	\subsection{Instance Data}
54
55	%\todoin{ add historical perspective on data - list overall}
56
57	The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
58	collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
59	16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
60	On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
61
62
63	\begin{table}
64	\caption{Top 20 CMD profiles, with the respective number of records}
65	\label{tab:cmd-profiles}
66	\begin{center}
67	\begin{tabu}{ r l }
68	\hline
69	\rowfont{\itshape\small} \# records & profile \\
70	\hline
71	155.403 & Song \\
72	138.257 & Session \\
73	92.996 & OLAC-DcmiTerms \\
74	46.156 & DcmiTerms \\
75	28.448 & SongScan \\
76	21.256 & SourceScan \\
77	19.059 & LiteraryCorpusProfile \\
78	16519 & Source \\
79	13626 & imdi-corpus \\
80	10610 & media-session-profile \\
81	7961 & SongAudio \\
82	7557 & SymbolicMusicNotation \\
83	4485 & LCC DataProviderProfile \\
84	4485 & SourceProfile \\
85	4417 & Text \\
86	1982 & Soundbites-recording \\
87	1530 & Performer \\
88	1475 & ArthurianFiction \\
89	939 & LrtInventoryResource \\
90	873 & teiHeader \\
91	\hline
92	\end{tabu}
93	\end{center}
94	\end{table}
95
96	\begin{table}
97	\caption{Top 20 CMD collections, with the respective number of records}
98	\begin{center}
99	\begin{tabu}{ r l }
100	\hline
101	\rowfont{\itshape\small} \# records & colleciton \\
102	\hline
103	243.129 & Meertens collection: Liederenbank \\
104	46.658 & DK-CLARIN Repository \\
105	46.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\
106	29.266 & childes \\
107	24.583 & DoBeS archive \\
108	23.185 & Language and Cognition \\
109	14.593 & talkbank \\
110	14.363 & Acquisition \\
111	14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
112	12.893 & MPI CGN \\
113	10.628 & Bavarian Archive for Speech Signals (BAS) \\
114	7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
115	7.348 & WALS RefDB \\
116	5.689 & Lund Corpora \\
117	4.640 & Oxford Text Archive \\
118	4.492 & Leipzig Corpora Collection \\
119	3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
120	3.280 & A Digital Archive of Research Papers in Computational Linguistics \\
121	3.147 & CLARIN NL \\
122	3.081 & MPI fÃŒr Bildungsforschung \\
123	\hline
124	\end{tabu}
125	\end{center}
126	\end{table}
127
128	We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
129
130
131
132	\section{Other LRT Metadata Formats and Collections }
133	\label{sec:lrt-md-catalogs}
134
135	Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
136
137	As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE.
138
139
140	\subsection{Dublin Core metadata terms}
141	The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
142
143	It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
144	\begin{description}
145	\item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\
146	the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
147	\item[Dublin Core metadata terms ] namespace: \code{/terms/} \\
148	the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
149	\end{description}
150
151	The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
152
153	There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
154	Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
155
156	The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
157
158	\subsection{OLAC}
159	\label{def:OLAC}
160
161	\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
162
163	The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}).
164
165	\begin{quotation}
166	Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. [\dots] OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type.
167	\end{quotation}
168
169	\lstset{language=XML}
170	\begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record]
171	<olac:olac>
172	<creator>Bloomfield, Leonard</creator>
173	<date>1933</date>
174	<title>Language</title>
175	<publisher>New York: Holt</publisher>
176	</olac:olac>
177	\end{lstlisting}
178
179	OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
180
181	Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
182
183
184
185	\subsection{TEI / teiHeader}
186	\label{def:tei}
187
188	\begin{quotation}
189	The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
190	\end{quotation}
191
192	TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
193
194	Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
195
196	There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
197
198
199	\subsection{ISLE/IMDI -- The Language Archive}
200
201	\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
202
203	To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
204
205	The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
206
207	IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion
208	as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics
209	of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model.
210
211
212	\subsection{META-SHARE}
213	\label{def:META-SHARE}
214
215	META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
216
217
218	\begin{quotation}
219	META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
220
221	\end{quotation}
222
223	Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
224	%In cooperation between metadata teams from CLARIN and META-SHARE
225
226	The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
227
228	The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
229
230	Selected member repositories\footnote{7 as of 2013-07} play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
231	The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
232
233	One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
234
235	%? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
236
237
238	\subsection{ELRA}
239
240	European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well).
241	The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
242	Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
243
244	\begin{quotation}
245	ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
246
247	ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
248
249	ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
250	drafts and concludes distribution agreements on behalf of ELRA.
251	\end{quotation}
252
253	\subsection{LDC}
254
255	Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives.
256
257	\section{Formats and Collections in the World of Libraries}
258	\label{sec:lib-formats}
259
260	There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right.
261
262	%\item[LoC] Library of Congress \url{http://www.loc.gov}
263	%\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
264	%\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
265	%\end{description}
266
267	\subsection{Formats -- MARC, METS, MODS}
268
269	There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
270
271	The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
272
273	MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
274
275	\xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
276	It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}.
277	A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
278
279	Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
280
281	\xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones,
282	more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
283
284	There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}.
285	And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now.
286
287	\subsection{ESE, Europeana Data Model - EDM}
288
289	Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}.
290
291	For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}.
292	EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format.
293	%https://github.com/europeana
294
295	%%%%%%%%%%%%%%%%%%
296	\section{Controlled Vocabularies, Reference Data, Ontologies}
297	\label{refdata}
298
299	One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
300	one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
301	\url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
302
303	Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
304
305	In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary}
306	How this resources will be employed is discussed in \ref{sec:values2entities}.
307	Additionally, some verbose commentary follows.
308
309	%\subsubsection{Named entities}
310
311	The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
312	Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
313
314	Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
315
316	Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}.
317
318	So we witness a strong general trend towards Semantic Web and Linked Open Data.
319
320	%Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain.
321
322	%Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type.
323
324	%\subsection{Concepts -- Classifications, Taxonomies, \dots}
325
326
327	\begin{comment}
328
329	VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
330
331	\subsection{schema.org}
332	http://schema.org/docs/datamodel.html
333	http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
334
335	microdata or
336	http://www.w3.org/TR/rdfa-lite/
337	Resource Description Framework in attributes
338
339	the entire WorldCat cataloging collection made publicly
340	available using Schema.org mark-up with library extensions for use by developers and
341	search partners such as Bing, Google, Yahoo! and Yandex
342
343	OCLC begins adding linked data to WorldCat by appending
344	Schema.org descriptive mark-up to WorldCat.org pages, thereby
345	making OCLC member library data available for use by intelligent
346	Web crawlers such as Google and Bing
347
348	\end{comment}
349
350	\section{Summary}
351
352	In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
353	We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities.
354
355
356
357	\begin{landscape}
358	\begin{table}
359	\caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica}
360	\label{table:data-ne}
361	% \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} }
362	\begin{tabu}{ >{\sffamily}l l r X X}
363	\hline
364	\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
365	\hline
366	VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\
367	GND/p & DNB & 4.6E6 & Persons, universal, lang:de & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\
368	GND/k & '' & 1.2E6 & Organizations, universal, lang:de & \\
369	GND/w & '' & 193,000 & Works, lang:de & \\
370	GND/g & '' & 293.000 & Geographica, lang:de & \\
371	ULAN & Getty & 202,720 / 638,900 & persons, artists & \\
372	TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\
373	%CONA & Getty & & records for cultural works & \\
374	dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\
375	& & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
376	Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
377	\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons& ontology-based portal for LRT & \href{http://www.lt-world.org/kb/}{portal} \\
378	& & 4.600 organizations & & \\
379	Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
380	PKND & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
381	\href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI & & archaeologically relevant places & search interface \\
382	%Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\
383	\href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\
384	LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\
385	ISO 3166 & ISO & 249 & Official country codes, lang: en, fr & \\
386	ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\
387	ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\
388	CLAVAS & CLARIN & 2.500 & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
389	\hline
390	\end{tabu}
391	\end{table}
392
393	\begin{comment}
394	\hline
395	\end{tabu}
396	\end{table}
397
398	\begin{table}
399	\caption{Controlled vocabularies of named entities -- Geographica}
400	\label{table:data-ne-places}
401
402	% \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} }
403	\begin{tabu}{ >{\sffamily}l l r X X}
404	\hline
405	\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
406
407	\end{comment}
408
409
410	\begin{table}
411	\caption{Taxonomies, Classifications, Thesauri}
412	\label{table:data-concepts}
413	\begin{tabu}{ >{\sffamily}l l r X X}
414	\hline
415	\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
416	\hline
417	AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in art and architecture & \\
418	LCSH & LoC & & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\
419	LCC & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\
420	GND/s & DNB & 202.000 & subjects (SchlagwÃ¶rter), universal, lang:de & \\
421	GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
422	DDC & OCLC & & universal classification by field of study, multi langs & \href{http://dewey.info/}{dewey.info} \\
423	UDC & & & & \\
424	Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
425	DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
426	ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts & \href{http://www.isocat.org}{web-app}, service \\
427	Object Names Thes. & British Museum & & classification of objects in the collection & \\
428	Material Thes. & British Museum & & classification of material & \\
429	Thes. Monument Types & British Museum & & types of monuments & \\
430	Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
431	Oberbegriffsdatei & DMB & & a set of vocabularies for museums, lang:de & \url{museumsvokabular.de}, PDF, XML dumps\\
432	Iconclass & RKD & 28,000 & taxonomy of subject of an image & \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\
433	\href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools) & \\
434	%Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\
435	\hline
436	\end{tabu}
437	\end{table}
438
439	\end{landscape}
440
441
442
443	\begin{table}
444	\caption{Glossary of acronyms used in the overview of controlled vocabularies (tables \ref{table:data-ne}, \ref{table:data-concepts}) }
445	\label{table:vocab-glossary}
446
447	% \begin{tabu}{ >{\sffamily}l p{0.8\textwidth}
448	\begin{tabular}{ >{\sffamily}l p{0.8\textwidth}}
449	% \hline
450	%\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
451	% \hline
452
453	AAT & international Architecture and Arts Thesaurus, Getty \\
454	CONA & Cultural Objects Name Authority \\
455	DAI & Deutsches ArchÃ€ologisches Institut \\
456	DDC & Dewey Decimal Classification \\
457	DFKI & Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz \\
458	DMB & Deutscher Museumsbund \\
459	DNB & Deutsche National Bibliothek \\
460	FAST & Faceted Application of Subject Terminology \\
461	Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\
462	GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\
463	GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\
464	% {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\
465	ISO & International Standardization Organization \\
466	LCCN & Library of Congress Control Number \\
467	LCC & Library of Congress Classification \\
468	LCSH & Library of Congress Subject Headings \\
469	LoC & Library of Congress\furl{http://loc.gov} \\
470	OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\
471	PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\
472	RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\
473	TGN & Getty Thesaurus of Geographic Names \\
474	UDC & Universal Decimal Classification \\
475	ULAN & Union List of Artist Names \\
476	VIAF & Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries \\
477	\end{tabular}
478	\end{table}
479

Note: See TracBrowser for help on using the repository browser.

Download in other formats: