source: SMC4LRT/chapters/Data.tex @ 3681

Last change on this file since 3681 was 3681, checked in by vronk, 11 years ago
File size: 37.7 KB
Line 
1
2\chapter{Analysis of the data landscape}
3\label{ch:data}
4This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of.
5
6
7\section{Component Metadata Framework}
8\label{def:CMD}
9
10The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
11CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
12The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
13indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
14
15%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
16
17While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
18
19Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
20The generated schema also conveys as annotation the information about the referenced data categories.
21
22
23\subsection{CMD Profiles }
24In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
25
26Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
27(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
28
29
30\begin{table}
31\caption{The development of defined profiles and DCs over time}
32\label{table:dev_profiles}
33%  \begin{tabular}{ l | r | r | r | r }
34  \begin{tabular}{ l  r  r  r  r }
35
36    \hline
37date     & 2011-01 & 2012-06 & 2013-01 & 2013-06  \\
38    \hline
39Profiles & 40 & 53 & 87 & 124 \\
40Distinct Components & 164 & 298 & 542 & 828 \\
41Expanded Components & 1055 & 1536 & 2904 & 5757 \\
42Distinct Elements & 511 & 893 & 1505 & 2399 \\
43Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
44Distinct data categories & 203 & 266 & 436 & 499 \\
45Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
46Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
47Components with DCs & 28 & 67 & 115 & 140 \\
48
49    \hline
50  \end{tabular}
51\end{table}
52
53
54\subsection{Instance Data}
55
56%\todoin{ add historical perspective on data - list overall}
57
58The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
59collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
6016 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
61On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
62
63
64\begin{table}
65\caption{Top 20 CMD profiles, with the respective number of records}
66\label{tab:cmd-profiles}
67\begin{center}
68  \begin{tabu}{ r l }
69    \hline
70\rowfont{\itshape\small} \# records & profile \\
71    \hline
72155.403 & Song \\
73138.257 & Session \\
7492.996 & OLAC-DcmiTerms \\
7546.156 & DcmiTerms \\
7628.448 & SongScan \\
7721.256 & SourceScan \\
7819.059 & LiteraryCorpusProfile \\
7916519 & Source \\
8013626 & imdi-corpus \\
8110610 & media-session-profile \\
827961 & SongAudio \\     
837557 & SymbolicMusicNotation \\
844485 & LCC DataProviderProfile \\
854485 & SourceProfile \\
864417 & Text \\
871982 & Soundbites-recording \\
881530 & Performer \\
891475 & ArthurianFiction \\
90939 & LrtInventoryResource \\
91873 & teiHeader \\
92    \hline
93  \end{tabu}
94\end{center}
95\end{table}
96
97\begin{table}
98\caption{Top 20 CMD collections, with the respective number of records}
99\begin{center}
100  \begin{tabu}{ r l }
101    \hline
102\rowfont{\itshape\small} \# records & colleciton \\
103    \hline
104243.129 & Meertens collection: Liederenbank \\
10546.658 & DK-CLARIN Repository \\
10646.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\
10729.266 & childes \\
10824.583 & DoBeS archive \\
10923.185 & Language and Cognition \\
11014.593 & talkbank \\
11114.363 & Acquisition \\
11214.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
11312.893 & MPI CGN \\
11410.628 & Bavarian Archive for Speech Signals (BAS) \\
1157.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
1167.348 & WALS RefDB \\
1175.689 & Lund Corpora \\
1184.640 & Oxford Text Archive \\
1194.492 & Leipzig Corpora Collection \\
1203.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
1213.280 & A Digital Archive of Research Papers in Computational Linguistics \\
1223.147 & CLARIN NL \\
1233.081 & MPI fÃŒr Bildungsforschung \\   
124\hline
125  \end{tabu}
126\end{center}
127\end{table}
128
129We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
130
131
132
133\section{Other LRT Metadata Formats and Collections }
134\label{sec:lrt-md-catalogs}
135
136Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
137
138Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI???
139
140
141\subsection{Dublin Core metadata terms}
142The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in  Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
143
144It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
145\begin{description}
146\item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\
147the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
148\item[Dublin Core metadata terms ]  namespace: \code{/terms/} \\
149the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
150\end{description}
151
152Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
153
154There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
155Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
156
157The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
158
159\subsection{OLAC}
160\label{def:OLAC}
161
162\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
163
164The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
165
166\begin{quotation}
167 Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. [\dots] OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type.
168\end{quotation}
169
170\lstset{language=XML}
171\begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record]
172<olac:olac>
173   <creator>Bloomfield, Leonard</creator>
174   <date>1933</date>
175   <title>Language</title>
176   <publisher>New York: Holt</publisher>
177</olac:olac>
178\end{lstlisting}
179
180OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
181
182Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
183
184
185
186\subsection{TEI / teiHeader}
187\label{def:tei}
188
189\begin{quotation}
190 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
191\end{quotation} 
192
193TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
194
195Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
196
197There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
198
199
200\subsection{ISLE/IMDI -- The Language Archive}
201
202\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
203
204To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
205
206The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
207
208IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion
209as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics
210of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model.
211
212
213\subsection{META-SHARE}
214\label{def:META-SHARE}
215
216META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
217
218
219\begin{quotation}
220META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
221
222\end{quotation}
223
224Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
225%In cooperation between metadata teams from CLARIN and META-SHARE
226
227The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
228
229The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
230
231Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
232The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
233
234One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
235
236? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
237
238
239\subsection{ELRA}
240
241European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well.
242The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
243Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
244
245\begin{quotation}
246ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
247
248ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
249
250ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
251drafts and concludes distribution agreements on behalf of ELRA.
252\end{quotation}
253
254\subsection{LDC}
255
256Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources
257
258
259\section{Formats and Collections in the World of Libraries}
260
261There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right.
262
263%\item[LoC] Library of Congress \url{http://www.loc.gov}
264%\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
265%\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
266%\end{description}
267
268\subsection{Formats  -- MARC, METS, MODS}
269
270There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
271
272The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
273
274MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
275
276\xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
277It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}.
278A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
279
280Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
281
282Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
283more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
284
285In 1998 a new  Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
286and since ?? RDA - Resource Description and Access
287
288\subsection{ESE, Europeana Data Model - EDM}
289
290Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently
291
292originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}.
293EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.
294%https://github.com/europeana
295
296%%%%%%%%%%%%%%%%%%
297\section{Controlled Vocabularies, Reference Data, Ontologies}
298\label{refdata}
299
300One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
301one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
302\url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
303
304Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
305
306In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.
307How this resources will be employed is discussed in \ref{sec:values2entities}.
308
309%\subsubsection{Named entities}
310
311The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
312Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
313
314Yago is a large knowledge integrating dbpedia, geonames and ..??
315
316Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
317
318So we witness a strong general trend towards Semantic Web and Linked Open Data.
319
320%Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain.
321
322%Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type.
323
324%\subsection{Concepts -- Classifications, Taxonomies, \dots}
325
326
327\begin{landscape}
328\begin{table}
329\caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica}
330\label{table:data-ne}
331%  \begin{tabu}{  p{0.2\textwidth}  p{0.2\textwidth}  p{0.2\textwidth}   p{0.2\textwidth}   p{0.2\textwidth} }
332  \begin{tabu}{  >{\sffamily}l l r X X}
333    \hline
334\rowfont{\itshape\small} name & provider & size (items / facts)  & description & access \\
335    \hline
336VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\
337GND/p & DNB & 4.6E6 & Persons, universal, lang:de  & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\
338GND/k & '' & 1.2E6 & Organizations, universal, lang:de  & \\
339GND/w & '' & 193,000 & Works, lang:de  & \\
340GND/g & '' & 293.000 & Geographica, lang:de & \\
341ULAN & Getty & 202,720 / 638,900 & persons, artists     & \\
342TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\
343%CONA & Getty & & records for cultural works & \\       
344dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\
345& & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
346Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
347\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\
348Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
349PKND     & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
350\href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI &  & archaeologically relevant places & search interface \\
351%Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\
352\href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\
353LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\
354ISO 3166 & ISO & 249 & Official country codes, lang: en, fr &   \\
355ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\
356ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\
357CLAVAS & CLARIN & 2.500  & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
358\hline
359\end{tabu}
360\end{table}
361
362\begin{comment}
363\hline
364  \end{tabu}
365\end{table}
366
367\begin{table}
368\caption{Controlled vocabularies of named entities -- Geographica}
369\label{table:data-ne-places}
370
371%  \begin{tabu}{  p{0.2\textwidth}  p{0.2\textwidth}  p{0.2\textwidth}   p{0.2\textwidth}   p{0.2\textwidth} }
372  \begin{tabu}{  >{\sffamily}l l r X X}
373    \hline
374\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
375
376\end{comment}
377
378
379\begin{table}
380\caption{Taxonomies, Classifications, Thesauri}
381\label{table:data-concepts}
382  \begin{tabu}{  >{\sffamily}l l r X X}
383    \hline
384\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
385    \hline
386AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in  art and architecture &  \\
387LCSH & LoC &  & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\
388LCC  & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\
389GND/s & DNB & 202.000 & subjects (Schlagwörter), universal, lang:de & \\
390GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
391DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\
392UDC & & & & \\
393Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
394 DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
395ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\
396Object Names Thesaurus & British Museum & &  classification of objects in the collection & \\
397Material Thesaurus & British Museum & & classification of material & \\
398Thesaurus of Monument Types & British Museum & & types of monuments & \\
399Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
400Oberbegriffsdatei  & DMB & & a set of vocabularies for museums, lang:de  & \url{museumsvokabular.de}, PDF, XML dumps\\
401Iconclass & RKD & 28,000 & taxonomy of subject of an image &  \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\
402\href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools)  &  \\
403%Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\
404\hline
405\end{tabu}
406\end{table}
407
408\end{landscape}
409
410\begin{description}
411\item[AAT] international Architecture and Arts Thesaurus, Getty
412\item[CONA] Cultural Objects Name Authority
413\item[DAI] Deutsches ArchÀologisches Institut
414\item[DDC] Dewey Decimal Classification
415\item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz
416\item[DMB] Deutscher Museumsbund
417\item[DNB] Deutsche National Bibliothek
418\item[FAST] Faceted Application of Subject Terminology
419\item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust
420\item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library
421\item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
422\begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation}
423\item[ISO] International Standardization Organization
424\item[LCCN] Library of Congress Control Number
425\item[LCC] Library of Congress Classification
426\item[LCSH] Library of Congress Subject Headings
427\item[LoC] Library of Congress\furl{http://loc.gov}
428\item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation
429\item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd}
430\item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History
431\item[TGN] Getty Thesaurus of Geographic Names
432\item[UDC] Universal Decimal Classification                             
433\item[ULAN] Union List of Artist Names
434\item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries
435\end{description}
436
437
438\begin{comment}
439
440VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
441
442\subsection{schema.org}
443http://schema.org/docs/datamodel.html
444http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
445
446microdata or
447http://www.w3.org/TR/rdfa-lite/
448 Resource Description Framework in attributes
449
450the entire WorldCat cataloging collection made publicly
451available using Schema.org mark-up with library extensions for use by developers and
452search partners such as Bing, Google, Yahoo! and Yandex
453
454OCLC begins adding linked data to WorldCat by appending
455Schema.org descriptive mark-up to WorldCat.org pages, thereby
456making OCLC member library data available for use by intelligent
457Web crawlers such as Google and Bing
458
459\end{comment}
460
461\section{Summary}
462
463In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
464We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications).
465
Note: See TracBrowser for help on using the repository browser.