Context Navigation

source: SMC4LRT/chapters/Data.tex @ 3671

Last change on this file since 3671 was 3671, checked in by vronk, 11 years ago

File size: 20.7 KB

Line
1
2	\chapter{Analysis of the data landscape}
3	\label{ch:data}
4	This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of.
5
6
7	\section{Component Metadata Framework}
8	\label{def:CMD}
9
10	The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.)
11	CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information.
12	The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus
13	indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}.
14
15	%This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}.
16
17	While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}.
18
19	Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records.
20	The generated schema also conveys as annotation the information about the referenced data categories.
21
22
23	\subsection{CMD Profiles }
24	In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time.
25
26	Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements
27	(when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}).
28
29
30	\begin{table}
31	\caption{The development of defined profiles and DCs over time}
32	\label{table:dev_profiles}
33	% \begin{tabular}{ l \| r \| r \| r \| r }
34	\begin{tabular}{ l r r r r }
35
36	\hline
37	date & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\
38	\hline
39	Profiles & 40 & 53 & 87 & 124 \\
40	Distinct Components & 164 & 298 & 542 & 828 \\
41	Expanded Components & 1055 & 1536 & 2904 & 5757 \\
42	Distinct Elements & 511 & 893 & 1505 & 2399 \\
43	Expanded Elements & 1971 & 3030 & 5754 & 13232 \\
44	Distinct data categories & 203 & 266 & 436 & 499 \\
45	Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\
46	Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\
47	Components with DCs & 28 & 67 & 115 & 140 \\
48
49	\hline
50	\end{tabular}
51	\end{table}
52
53
54	\subsection{Instance Data}
55
56	%\todoin{ add historical perspective on data - list overall}
57
58	The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}}
59	collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records.
60	16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152.
61	On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles.
62
63
64	\begin{table}
65	\caption{Top 20 CMD profiles, with the respective number of records}
66	\label{tab:cmd-profiles}
67	\begin{center}
68	\begin{tabular}{ r l }
69	\hline
70	\# records & profile \\
71	\hline
72	155.403 & Song \\
73	138.257 & Session \\
74	92.996 & OLAC-DcmiTerms \\
75	46.156 & DcmiTerms \\
76	28.448 & SongScan \\
77	21.256 & SourceScan \\
78	19.059 & LiteraryCorpusProfile \\
79	16519 & Source \\
80	13626 & imdi-corpus \\
81	10610 & media-session-profile \\
82	7961 & SongAudio \\
83	7557 & SymbolicMusicNotation \\
84	4485 & LCC DataProviderProfile \\
85	4485 & SourceProfile \\
86	4417 & Text \\
87	1982 & Soundbites-recording \\
88	1530 & Performer \\
89	1475 & ArthurianFiction \\
90	939 & LrtInventoryResource \\
91	873 & teiHeader \\
92	\hline
93	\end{tabular}
94	\end{center}
95	\end{table}
96
97	\begin{table}
98	\caption{Top 20 CMD collections, with the respective number of records}
99	\begin{center}
100	\begin{tabular}{ r l }
101	\hline
102	\# records & colleciton \\
103	\hline
104	243.129 & Meertens collection: Liederenbank \\
105	46.658 & DK-CLARIN Repository \\
106	46.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\
107	29.266 & childes \\
108	24.583 & DoBeS archive \\
109	23.185 & Language and Cognition \\
110	14.593 & talkbank \\
111	14.363 & Acquisition \\
112	14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
113	12.893 & MPI CGN \\
114	10.628 & Bavarian Archive for Speech Signals (BAS) \\
115	7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\
116	7.348 & WALS RefDB \\
117	5.689 & Lund Corpora \\
118	4.640 & Oxford Text Archive \\
119	4.492 & Leipzig Corpora Collection \\
120	3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\
121	3.280 & A Digital Archive of Research Papers in Computational Linguistics \\
122	3.147 & CLARIN NL \\
123	3.081 & MPI fÃŒr Bildungsforschung \\
124	\hline
125	\end{tabular}
126	\end{center}
127	\end{table}
128
129	We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource).
130
131
132
133	\section{Other Metadata Formats and Collections }
134
135
136	Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI?
137
138	The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
139
140
141	\subsection{Dublin Core metadata terms + OLAC}
142	Since 1995
143	Maintained Dublin Core Metadata Initiative
144	DC, OLAC
145
146	"Dublin" refers to Dublin, Ohio, USA where the work originated during the 1995 invitational OCLC/NCSA Metadata Workshop,[8] hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA).
147
148	comes in two version: 15 core elements and 55 qualified terms ?
149
150	\begin{quotation}
151	Early Dublin Core workshops popularized the idea of "core metadata" for simple and generic resource descriptions. The fifteen-element "Dublin Core" achieved wide dissemination as part of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and has been ratified as IETF RFC 5013, ANSI/NISO Standard Z39.85-2007, and ISO Standard 15836:2009.
152	\end{quotation}
153
154
155
156	Given its simplicity it is used as the common denominator in many applications, among others it is the base format in the OAI-PMH protocol.
157
158	It is required/expected as the base
159	openarchives register: \url{http://www.openarchives.org/Register/BrowseSites}
160	2006 OAI-repositories
161
162	DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
163
164	DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/}
165
166	\label{def:OLAC}
167
168	\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format\cite{Bird2001},OLAC \cite{Simons2003OLAC} is a more specialized version of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community:
169
170	\begin{quotation}
171	Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. [\dots] OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type.
172	\end{quotation}
173
174	The \xne{OLAC Metadata} is the set of metadata elements archives participating in have agreed to use for describing language resources.
175
176	\todoin{check http://www.language-archives.org/OLAC/metadata.html}
177
178	OLAC Archives contain over 100,000 records, covering resources in half of the world's living languages. More statistics on coverage.
179	http://www.language-archives.org/
180
181	Most of the OLAC records are integrated into CMDI (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC})
182
183
184	\subsection{TEI / teiHeader}
185	\label{def:tei}
186
187	\begin{quotation}
188	The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.
189	\end{quotation}
190
191	\url{http://www.tei-c.org/}
192
193	TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs.
194
195	Thus there is also not just one fixed \xne{teiHeader}.
196
197	TEI/teiHeader/ODD,
198
199
200
201	\subsection{ISLE/IMDI}
202
203	IMDI = ISLE Metadata
204	http://www.mpi.nl/imdi/
205
206	The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
207
208	Predecessor of CMDI
209
210	\subsection{MODS/METS}
211
212	Metadata Encoding and Transmission Standard - an XML schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library
213
214	Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.
215
216	\subsection{ESE, Europeana Data Model - EDM}
217
218	ESE Europeana Semantic Elements-
219
220	EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all} \cite{doerr2010europeana}
221
222
223	he Linked Data approach will play a major role in the European Digital Library (
224	http://europeana.eu
225	)
226	and solutions that can handle data expressed in the newly created, RDF-based
227	Europeana Data Model
228	(EDM)
229	are currently being investigated. This report summarizes the results of a study we performed on existing
230	RDF stores, in the context of Europeana and encompasses the following contributions
231
232
233	data.europeana.eu: The Europeana Linked Open Data Pilot\cite{haslhofer2011data}
234
235	\subsection{META-SHARE}
236	\label{def:META-SHARE}
237	Within the project META-SHARE format
238
239	META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
240	%In cooperation between metadata teams from CLARIN and META-SHARE
241
242	The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
243
244	MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
245
246
247	\subsection{Other}
248
249	OAI-ORE - is this a schema?
250
251
252
253	\section{Ontologies, Controlled Vocabularies, Reference Data, Authority Files}
254	\label{refdata}
255
256	Based on popular demand, the work on reference data for the SSH-community should cover at least the following dimensions (with tentative denominations of corresponding existing vocabularies):
257
258	\begin{itemize}
259	\item Data Categories / Concepts - ISOcat
260	\item Languages - ISO-639
261	\item Countries - country codes
262	\item Persons - GND, VIAF
263	\item Organizations - GND, VIAF
264	\item SchlagwÃ¶rter/Subjects - GND, LCSH
265	\item Resource Typology -
266	\end{itemize}
267
268	AAT - international Architecture and Arts Thesaurus
269	GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd}
270	GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
271	VIAF - Virtual International Authority File
272
273
274	Other related relevant activities and initiatives
275
276	A broader collection of related initiatives can be found at the German National Library website:
277	\furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html}
278	FRBR - Functional Requirements for Bibliographic Records 2002 \cite{FRBR1998}
279
280	RDA - Resource Description and Access
281	http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
282	At MPDL, within the escidoc publication platform there seems to be (work on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities}
283	Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities â developed at the New Zealand Electronic Text Centre (NZETC).
284	http://eats.readthedocs.org/en/latest/
285
286
287	\subsection{ISOcat - Data Category Registry}
288
289	ISO12620
290
291	\subsection{Classification Schemes, Taxonomies }
292	LCSH, DDC
293
294
295	\subsection{Other controlled Vocabularies}
296
297	Language codes ISO-639-1
298
299	\subsection{Domain Ontologies, Vocabularies}
300	Organization-Lists
301	LT-World !?
302
303
304	\subsubsection{LT-World}
305	Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
306
307
308
309	\section{LRT Metadata Catalogs/Collections}
310	\label{sec:lrt-md-catalogs}
311	\todoin{Overview of catalogs, name, since, \#providers, \#resources}
312
313	\todoin{[DFKI/LT-World] - collection or ontology}
314
315	\subsection{CMDI}
316	collections, profiles/Terms, ResourceTypes!
317
318	\subsection{OLAC}
319
320	\subsection{LAT, TLA}
321	Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
322
323	\subsection{META-NET}
324
325
326
327	\begin{quotation}
328	META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
329
330	META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
331
332	\end{quotation}
333
334	The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
335
336	A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
337	The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
338
339
340	MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
341
342
343
344	\subsection{ELRA}
345
346	European Language Resources Association
347
348	\furl{http://elra.info}
349
350
351	ELRAâs missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
352
353
354	http://www.elda.org/
355	Evaluations and Language resources Distribution Agency
356
357	ELDA - Evaluations and Language resources Distribution Agency â is ELRAâs operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT â Human Language Technology â community. Besides, ELDA is involved in HLT evaluation campaigns.
358
359	ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
360
361	ELRA Catalog
362
363	http://catalog.elra.info/
364
365
366	Universal Catalog+
367	Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
368
369
370	\subsection{Other}
371
372
373	\begin{description}
374	\item[LDC] Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/}
375	\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
376	\end{description}
377
378	\section{Other Metadata Catalogs/Collections}
379	\label{sec:other-md-catalogs}
380
381	\subsection{(Digital) Libraries}
382
383
384	General (Libraries, Federations):
385
386	\begin{description}
387	\item[OCLC] \url{http://www.oclc.org}
388	world's biggest Library Federation
389	\item[LoC] Library of Congress \url{http://www.loc.gov}
390	\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
391	\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
392	\end{description}
393
394
395
396
397	\section{Summary}
398
399	In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology
400

Note: See TracBrowser for help on using the repository browser.

Download in other formats: