Context Navigation

source: SMC4LRT/chapters/Data.tex @ 3681

Last change on this file since 3681 was 3681, checked in by vronk, 11 years ago

File size: 37.7 KB

Line
1
2	\chapter{Analysis of the data landscape}
3	\label{ch:data}
4	This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of.
5
6
7	\section{Component Metadata Framework}
8	\label{def:CMD}
9
10	The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
11	CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
12	The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
13	indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
14
15	%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
16
17	While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
18
19	Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
20	The generated schema also conveys as annotation the information about the referenced data categories.
21
22
23	\subsection{CMD Profiles }
24	In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
25
26	Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
27	(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
28
29
30	\begin{table}
31	\caption{The development of defined profiles and DCs over time}
32	\label{table:dev_profiles}
33	% \begin{tabular}{ l \| r \| r \| r \| r }
34	\begin{tabular}{ l r r r r }
35
36	\hline
37	date & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\
38	\hline
39	Profiles & 40 & 53 & 87 & 124 \\
40	Distinct Components & 164 & 298 & 542 & 828 \\
41	Expanded Components & 1055 & 1536 & 2904 & 5757 \\
42	Distinct Elements & 511 & 893 & 1505 & 2399 \\
43	Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
44	Distinct data categories & 203 & 266 & 436 & 499 \\
45	Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
46	Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
47	Components with DCs & 28 & 67 & 115 & 140 \\
48
49	\hline
50	\end{tabular}
51	\end{table}
52
53
54	\subsection{Instance Data}
55
56	%\todoin{ add historical perspective on data - list overall}
57
58	The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
59	collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
60	16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
61	On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
62
63
64	\begin{table}
65	\caption{Top 20 CMD profiles, with the respective number of records}
66	\label{tab:cmd-profiles}
67	\begin{center}
68	\begin{tabu}{ r l }
69	\hline
70	\rowfont{\itshape\small} \# records & profile \\
71	\hline
72	155.403 & Song \\
73	138.257 & Session \\
74	92.996 & OLAC-DcmiTerms \\
75	46.156 & DcmiTerms \\
76	28.448 & SongScan \\
77	21.256 & SourceScan \\
78	19.059 & LiteraryCorpusProfile \\
79	16519 & Source \\
80	13626 & imdi-corpus \\
81	10610 & media-session-profile \\
82	7961 & SongAudio \\
83	7557 & SymbolicMusicNotation \\
84	4485 & LCC DataProviderProfile \\
85	4485 & SourceProfile \\
86	4417 & Text \\
87	1982 & Soundbites-recording \\
88	1530 & Performer \\
89	1475 & ArthurianFiction \\
90	939 & LrtInventoryResource \\
91	873 & teiHeader \\
92	\hline
93	\end{tabu}
94	\end{center}
95	\end{table}
96
97	\begin{table}
98	\caption{Top 20 CMD collections, with the respective number of records}
99	\begin{center}
100	\begin{tabu}{ r l }
101	\hline
102	\rowfont{\itshape\small} \# records & colleciton \\
103	\hline
104	243.129 & Meertens collection: Liederenbank \\
105	46.658 & DK-CLARIN Repository \\
106	46.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\
107	29.266 & childes \\
108	24.583 & DoBeS archive \\
109	23.185 & Language and Cognition \\
110	14.593 & talkbank \\
111	14.363 & Acquisition \\
112	14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
113	12.893 & MPI CGN \\
114	10.628 & Bavarian Archive for Speech Signals (BAS) \\
115	7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
116	7.348 & WALS RefDB \\
117	5.689 & Lund Corpora \\
118	4.640 & Oxford Text Archive \\
119	4.492 & Leipzig Corpora Collection \\
120	3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
121	3.280 & A Digital Archive of Research Papers in Computational Linguistics \\
122	3.147 & CLARIN NL \\
123	3.081 & MPI fÃŒr Bildungsforschung \\
124	\hline
125	\end{tabu}
126	\end{center}
127	\end{table}
128
129	We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
130
131
132
133	\section{Other LRT Metadata Formats and Collections }
134	\label{sec:lrt-md-catalogs}
135
136	Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
137
138	Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI???
139
140
141	\subsection{Dublin Core metadata terms}
142	The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
143
144	It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
145	\begin{description}
146	\item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\
147	the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
148	\item[Dublin Core metadata terms ] namespace: \code{/terms/} \\
149	the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
150	\end{description}
151
152	Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
153
154	There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
155	Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
156
157	The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
158
159	\subsection{OLAC}
160	\label{def:OLAC}
161
162	\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
163
164	The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
165
166	\begin{quotation}
167	Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. [\dots] OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type.
168	\end{quotation}
169
170	\lstset{language=XML}
171	\begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record]
172	<olac:olac>
173	<creator>Bloomfield, Leonard</creator>
174	<date>1933</date>
175	<title>Language</title>
176	<publisher>New York: Holt</publisher>
177	</olac:olac>
178	\end{lstlisting}
179
180	OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
181
182	Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
183
184
185
186	\subsection{TEI / teiHeader}
187	\label{def:tei}
188
189	\begin{quotation}
190	The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
191	\end{quotation}
192
193	TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
194
195	Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
196
197	There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
198
199
200	\subsection{ISLE/IMDI -- The Language Archive}
201
202	\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
203
204	To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
205
206	The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
207
208	IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion
209	as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics
210	of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model.
211
212
213	\subsection{META-SHARE}
214	\label{def:META-SHARE}
215
216	META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
217
218
219	\begin{quotation}
220	META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
221
222	\end{quotation}
223
224	Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
225	%In cooperation between metadata teams from CLARIN and META-SHARE
226
227	The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
228
229	The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
230
231	Selected member repositories\footnote{7 as of 2013-07} play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
232	The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
233
234	One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
235
236	? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
237
238
239	\subsection{ELRA}
240
241	European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well.
242	The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
243	Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
244
245	\begin{quotation}
246	ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
247
248	ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
249
250	ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
251	drafts and concludes distribution agreements on behalf of ELRA.
252	\end{quotation}
253
254	\subsection{LDC}
255
256	Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources
257
258
259	\section{Formats and Collections in the World of Libraries}
260
261	There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right.
262
263	%\item[LoC] Library of Congress \url{http://www.loc.gov}
264	%\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
265	%\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
266	%\end{description}
267
268	\subsection{Formats -- MARC, METS, MODS}
269
270	There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
271
272	The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
273
274	MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
275
276	\xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
277	It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}.
278	A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
279
280	Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
281
282	Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones,
283	more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
284
285	In 1998 a new Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records 2002 \cite{FRBR1998}
286	and since ?? RDA - Resource Description and Access
287
288	\subsection{ESE, Europeana Data Model - EDM}
289
290	Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently
291
292	originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}.
293	EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.
294	%https://github.com/europeana
295
296	%%%%%%%%%%%%%%%%%%
297	\section{Controlled Vocabularies, Reference Data, Ontologies}
298	\label{refdata}
299
300	One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
301	one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
302	\url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
303
304	Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
305
306	In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.
307	How this resources will be employed is discussed in \ref{sec:values2entities}.
308
309	%\subsubsection{Named entities}
310
311	The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
312	Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
313
314	Yago is a large knowledge integrating dbpedia, geonames and ..??
315
316	Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
317
318	So we witness a strong general trend towards Semantic Web and Linked Open Data.
319
320	%Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain.
321
322	%Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type.
323
324	%\subsection{Concepts -- Classifications, Taxonomies, \dots}
325
326
327	\begin{landscape}
328	\begin{table}
329	\caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica}
330	\label{table:data-ne}
331	% \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} }
332	\begin{tabu}{ >{\sffamily}l l r X X}
333	\hline
334	\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
335	\hline
336	VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\
337	GND/p & DNB & 4.6E6 & Persons, universal, lang:de & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\
338	GND/k & '' & 1.2E6 & Organizations, universal, lang:de & \\
339	GND/w & '' & 193,000 & Works, lang:de & \\
340	GND/g & '' & 293.000 & Geographica, lang:de & \\
341	ULAN & Getty & 202,720 / 638,900 & persons, artists & \\
342	TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\
343	%CONA & Getty & & records for cultural works & \\
344	dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\
345	& & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
346	Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
347	\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\
348	Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
349	PKND & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
350	\href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI & & archaeologically relevant places & search interface \\
351	%Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\
352	\href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\
353	LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\
354	ISO 3166 & ISO & 249 & Official country codes, lang: en, fr & \\
355	ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\
356	ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\
357	CLAVAS & CLARIN & 2.500 & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
358	\hline
359	\end{tabu}
360	\end{table}
361
362	\begin{comment}
363	\hline
364	\end{tabu}
365	\end{table}
366
367	\begin{table}
368	\caption{Controlled vocabularies of named entities -- Geographica}
369	\label{table:data-ne-places}
370
371	% \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} }
372	\begin{tabu}{ >{\sffamily}l l r X X}
373	\hline
374	\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
375
376	\end{comment}
377
378
379	\begin{table}
380	\caption{Taxonomies, Classifications, Thesauri}
381	\label{table:data-concepts}
382	\begin{tabu}{ >{\sffamily}l l r X X}
383	\hline
384	\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
385	\hline
386	AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in art and architecture & \\
387	LCSH & LoC & & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\
388	LCC & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\
389	GND/s & DNB & 202.000 & subjects (SchlagwÃ¶rter), universal, lang:de & \\
390	GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
391	DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\
392	UDC & & & & \\
393	Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
394	DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
395	ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\
396	Object Names Thesaurus & British Museum & & classification of objects in the collection & \\
397	Material Thesaurus & British Museum & & classification of material & \\
398	Thesaurus of Monument Types & British Museum & & types of monuments & \\
399	Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
400	Oberbegriffsdatei & DMB & & a set of vocabularies for museums, lang:de & \url{museumsvokabular.de}, PDF, XML dumps\\
401	Iconclass & RKD & 28,000 & taxonomy of subject of an image & \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\
402	\href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools) & \\
403	%Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\
404	\hline
405	\end{tabu}
406	\end{table}
407
408	\end{landscape}
409
410	\begin{description}
411	\item[AAT] international Architecture and Arts Thesaurus, Getty
412	\item[CONA] Cultural Objects Name Authority
413	\item[DAI] Deutsches ArchÃ€ologisches Institut
414	\item[DDC] Dewey Decimal Classification
415	\item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz
416	\item[DMB] Deutscher Museumsbund
417	\item[DNB] Deutsche National Bibliothek
418	\item[FAST] Faceted Application of Subject Terminology
419	\item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust
420	\item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library
421	\item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
422	\begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation}
423	\item[ISO] International Standardization Organization
424	\item[LCCN] Library of Congress Control Number
425	\item[LCC] Library of Congress Classification
426	\item[LCSH] Library of Congress Subject Headings
427	\item[LoC] Library of Congress\furl{http://loc.gov}
428	\item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation
429	\item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd}
430	\item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History
431	\item[TGN] Getty Thesaurus of Geographic Names
432	\item[UDC] Universal Decimal Classification
433	\item[ULAN] Union List of Artist Names
434	\item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries
435	\end{description}
436
437
438	\begin{comment}
439
440	VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
441
442	\subsection{schema.org}
443	http://schema.org/docs/datamodel.html
444	http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
445
446	microdata or
447	http://www.w3.org/TR/rdfa-lite/
448	Resource Description Framework in attributes
449
450	the entire WorldCat cataloging collection made publicly
451	available using Schema.org mark-up with library extensions for use by developers and
452	search partners such as Bing, Google, Yahoo! and Yandex
453
454	OCLC begins adding linked data to WorldCat by appending
455	Schema.org descriptive mark-up to WorldCat.org pages, thereby
456	making OCLC member library data available for use by intelligent
457	Web crawlers such as Google and Bing
458
459	\end{comment}
460
461	\section{Summary}
462
463	In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
464	We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications).
465

Note: See TracBrowser for help on using the repository browser.

Download in other formats: