1 | |
---|
2 | \chapter{Analysis of the data landscape} |
---|
3 | \label{ch:data} |
---|
4 | This section gives an overview of existing standards and formats for metadata in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the initiatives and data collections. Special attention is paid to the Component Metadata Framework representing the base data model for the infrastructure this work is part of. |
---|
5 | |
---|
6 | |
---|
7 | \section{Component Metadata Framework} |
---|
8 | \label{def:CMD} |
---|
9 | |
---|
10 | The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN Component Metadata Infrastructure. (See \ref{def:CMDI} for information about the infrastructure. The XML-schema defining CMD -- the \xne{general-component-schema} -- is featured in appendix \ref{lst:cmd-schema}.) |
---|
11 | CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information. |
---|
12 | The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category(cf. \ref{def:DCR}\footnote{in short: persistently referencable concept definition}), thus |
---|
13 | indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. |
---|
14 | |
---|
15 | %This approach of integrating prerequisites for semantic interoperability directly into the process of metadata creation is fundamentally different from the traditional methods of schema matching that try to establish pairwise alignments between already existing schemas -- be it algorithm-based or by means of explicit manually defined crosswalks\cite{Shvaiko2005}. |
---|
16 | |
---|
17 | While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}. |
---|
18 | |
---|
19 | Once the profiles are defined they are transformed into a XML Schema, that prescribes the structure of the instance records. |
---|
20 | The generated schema also conveys as annotation the information about the referenced data categories. |
---|
21 | |
---|
22 | |
---|
23 | \subsection{CMD Profiles } |
---|
24 | In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time. |
---|
25 | |
---|
26 | Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\concept{dublincore}, \concept{collection}, the set of \concept{Bamdes}-profiles) there are complex profiles with up to 10 levels (\concept{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 117 distinct components and 337 elements (or 419 components and 1587 elements when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \concept{Contact}) included by three other components (\concept{Project}, \concept{Institution}, \concept{Access}) will appear three times in the instantiated record.}). |
---|
27 | |
---|
28 | |
---|
29 | \begin{table} |
---|
30 | \caption{The development of defined profiles and DCs over time} |
---|
31 | \label{table:dev_profiles} |
---|
32 | % \begin{tabular}{ l | r | r | r | r } |
---|
33 | \begin{tabular}{ l r r r r } |
---|
34 | |
---|
35 | \hline |
---|
36 | date & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\ |
---|
37 | \hline |
---|
38 | Profiles & 40 & 53 & 87 & 124 \\ |
---|
39 | Distinct Components & 164 & 298 & 542 & 828 \\ |
---|
40 | Expanded Components & 1055 & 1536 & 2904 & 5757 \\ |
---|
41 | Distinct Elements & 511 & 893 & 1505 & 2399 \\ |
---|
42 | Expanded Elements & 1971 & 3030 & 5754 & 13232 \\ |
---|
43 | Distinct data categories & 203 & 266 & 436 & 499 \\ |
---|
44 | Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\ |
---|
45 | Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\ |
---|
46 | Components with DCs & 28 & 67 & 115 & 140 \\ |
---|
47 | |
---|
48 | \hline |
---|
49 | \end{tabular} |
---|
50 | \end{table} |
---|
51 | |
---|
52 | |
---|
53 | \subsection{Instance Data} |
---|
54 | |
---|
55 | %\todoin{ add historical perspective on data - list overall} |
---|
56 | |
---|
57 | The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} |
---|
58 | collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records. |
---|
59 | 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. |
---|
60 | On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles. |
---|
61 | |
---|
62 | |
---|
63 | \begin{table} |
---|
64 | \caption{Top 20 CMD profiles, with the respective number of records} |
---|
65 | \label{tab:cmd-profiles} |
---|
66 | \begin{center} |
---|
67 | \begin{tabu}{ r l } |
---|
68 | \hline |
---|
69 | \rowfont{\itshape\small} \# records & profile \\ |
---|
70 | \hline |
---|
71 | 155.403 & Song \\ |
---|
72 | 138.257 & Session \\ |
---|
73 | 92.996 & OLAC-DcmiTerms \\ |
---|
74 | 46.156 & DcmiTerms \\ |
---|
75 | 28.448 & SongScan \\ |
---|
76 | 21.256 & SourceScan \\ |
---|
77 | 19.059 & LiteraryCorpusProfile \\ |
---|
78 | 16519 & Source \\ |
---|
79 | 13626 & imdi-corpus \\ |
---|
80 | 10610 & media-session-profile \\ |
---|
81 | 7961 & SongAudio \\ |
---|
82 | 7557 & SymbolicMusicNotation \\ |
---|
83 | 4485 & LCC DataProviderProfile \\ |
---|
84 | 4485 & SourceProfile \\ |
---|
85 | 4417 & Text \\ |
---|
86 | 1982 & Soundbites-recording \\ |
---|
87 | 1530 & Performer \\ |
---|
88 | 1475 & ArthurianFiction \\ |
---|
89 | 939 & LrtInventoryResource \\ |
---|
90 | 873 & teiHeader \\ |
---|
91 | \hline |
---|
92 | \end{tabu} |
---|
93 | \end{center} |
---|
94 | \end{table} |
---|
95 | |
---|
96 | \begin{table} |
---|
97 | \caption{Top 20 CMD collections, with the respective number of records} |
---|
98 | \begin{center} |
---|
99 | \begin{tabu}{ r l } |
---|
100 | \hline |
---|
101 | \rowfont{\itshape\small} \# records & colleciton \\ |
---|
102 | \hline |
---|
103 | 243.129 & Meertens collection: Liederenbank \\ |
---|
104 | 46.658 & DK-CLARIN Repository \\ |
---|
105 | 46.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\ |
---|
106 | 29.266 & childes \\ |
---|
107 | 24.583 & DoBeS archive \\ |
---|
108 | 23.185 & Language and Cognition \\ |
---|
109 | 14.593 & talkbank \\ |
---|
110 | 14.363 & Acquisition \\ |
---|
111 | 14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\ |
---|
112 | 12.893 & MPI CGN \\ |
---|
113 | 10.628 & Bavarian Archive for Speech Signals (BAS) \\ |
---|
114 | 7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\ |
---|
115 | 7.348 & WALS RefDB \\ |
---|
116 | 5.689 & Lund Corpora \\ |
---|
117 | 4.640 & Oxford Text Archive \\ |
---|
118 | 4.492 & Leipzig Corpora Collection \\ |
---|
119 | 3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\ |
---|
120 | 3.280 & A Digital Archive of Research Papers in Computational Linguistics \\ |
---|
121 | 3.147 & CLARIN NL \\ |
---|
122 | 3.081 & MPI fÃŒr Bildungsforschung \\ |
---|
123 | \hline |
---|
124 | \end{tabu} |
---|
125 | \end{center} |
---|
126 | \end{table} |
---|
127 | |
---|
128 | We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). |
---|
129 | |
---|
130 | |
---|
131 | |
---|
132 | \section{Other LRT Metadata Formats and Collections } |
---|
133 | \label{sec:lrt-md-catalogs} |
---|
134 | |
---|
135 | Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts. |
---|
136 | |
---|
137 | As for comprehensive overview of formats and standards, the CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} pus the overwhelming amount of existing metadata standards into a systematic comprehensive visual overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE. |
---|
138 | |
---|
139 | |
---|
140 | \subsection{Dublin Core metadata terms} |
---|
141 | The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative. |
---|
142 | |
---|
143 | It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}: |
---|
144 | \begin{description} |
---|
145 | \item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\ |
---|
146 | the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007 |
---|
147 | \item[Dublin Core metadata terms ] namespace: \code{/terms/} \\ |
---|
148 | the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency) |
---|
149 | \end{description} |
---|
150 | |
---|
151 | The DCMI terms format is very widely spread nowadays. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers. |
---|
152 | |
---|
153 | There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}. |
---|
154 | Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}. |
---|
155 | |
---|
156 | The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with. |
---|
157 | |
---|
158 | \subsection{OLAC} |
---|
159 | \label{def:OLAC} |
---|
160 | |
---|
161 | \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}. |
---|
162 | |
---|
163 | The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field}, \code{role}, \code{linguistic-type}, \code{language}, \code{discourse-type}). |
---|
164 | |
---|
165 | \begin{quotation} |
---|
166 | Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. [\dots] OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type. |
---|
167 | \end{quotation} |
---|
168 | |
---|
169 | \lstset{language=XML} |
---|
170 | \begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record] |
---|
171 | <olac:olac> |
---|
172 | <creator>Bloomfield, Leonard</creator> |
---|
173 | <date>1933</date> |
---|
174 | <title>Language</title> |
---|
175 | <publisher>New York: Holt</publisher> |
---|
176 | </olac:olac> |
---|
177 | \end{lstlisting} |
---|
178 | |
---|
179 | OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''. |
---|
180 | |
---|
181 | Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}). |
---|
182 | |
---|
183 | |
---|
184 | |
---|
185 | \subsection{TEI / teiHeader} |
---|
186 | \label{def:tei} |
---|
187 | |
---|
188 | \begin{quotation} |
---|
189 | The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged] |
---|
190 | \end{quotation} |
---|
191 | |
---|
192 | TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}. |
---|
193 | |
---|
194 | Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/} |
---|
195 | |
---|
196 | There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure. |
---|
197 | |
---|
198 | |
---|
199 | \subsection{ISLE/IMDI -- The Language Archive} |
---|
200 | |
---|
201 | \xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003. |
---|
202 | |
---|
203 | To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository. |
---|
204 | |
---|
205 | The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.). |
---|
206 | |
---|
207 | IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion |
---|
208 | as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics |
---|
209 | of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model. |
---|
210 | |
---|
211 | |
---|
212 | \subsection{META-SHARE} |
---|
213 | \label{def:META-SHARE} |
---|
214 | |
---|
215 | META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects. |
---|
216 | |
---|
217 | |
---|
218 | \begin{quotation} |
---|
219 | META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role. |
---|
220 | |
---|
221 | \end{quotation} |
---|
222 | |
---|
223 | Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components. |
---|
224 | %In cooperation between metadata teams from CLARIN and META-SHARE |
---|
225 | |
---|
226 | The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI) |
---|
227 | |
---|
228 | The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}. |
---|
229 | |
---|
230 | Selected member repositories\footnote{7 as of 2013-07} play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. |
---|
231 | The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). |
---|
232 | |
---|
233 | One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint. |
---|
234 | |
---|
235 | %? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} |
---|
236 | |
---|
237 | |
---|
238 | \subsection{ELRA} |
---|
239 | |
---|
240 | European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources (over 1.100) with focus on spoken resources, but also written, terminological and multimodal resources, mostly under license for a fee (although selected resources are available for free as well). |
---|
241 | The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/} |
---|
242 | Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world. |
---|
243 | |
---|
244 | \begin{quotation} |
---|
245 | ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. |
---|
246 | |
---|
247 | ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. |
---|
248 | |
---|
249 | ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and |
---|
250 | drafts and concludes distribution agreements on behalf of ELRA. |
---|
251 | \end{quotation} |
---|
252 | |
---|
253 | \subsection{LDC} |
---|
254 | |
---|
255 | Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} hosted by University of Pennsylvania is another provider/aggregator of high quality curated language resources. The data is provided for a fee, more than 650 resources have been made available since 1993. The catalog is freely accessible. The metadata is additionally aggregated by OLAC archives. |
---|
256 | |
---|
257 | \section{Formats and Collections in the World of Libraries} |
---|
258 | \label{sec:lib-formats} |
---|
259 | |
---|
260 | There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even the sole bibliographic records constitute sizable language resources in they own right. |
---|
261 | |
---|
262 | %\item[LoC] Library of Congress \url{http://www.loc.gov} |
---|
263 | %\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm} |
---|
264 | %\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/} |
---|
265 | %\end{description} |
---|
266 | |
---|
267 | \subsection{Formats -- MARC, METS, MODS} |
---|
268 | |
---|
269 | There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}. |
---|
270 | |
---|
271 | The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world. |
---|
272 | |
---|
273 | MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML; |
---|
274 | |
---|
275 | \xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. |
---|
276 | It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}. |
---|
277 | A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}. |
---|
278 | |
---|
279 | Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html} |
---|
280 | |
---|
281 | \xne{Metadata Object Description Schema} - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones, |
---|
282 | more than Dublin Core. One of endorsed schemas to extend (be used inside) METS. |
---|
283 | |
---|
284 | There have been efforts to create a conceptually more sound base for the bibliographic data -- in 1998 \xne{Functional Requirements for Bibliographic Records} (FRBR) \cite{FRBR1998} was published, an abstract model for the data expressed as an Entity Relationship Model and a standard based on FRBR, the \xne{Resource Description and Access} (RDA) has been proposed as an comprehensive standard for resource description and discovery, that however was confronted with opposition from the LIS community, questioning the need of abandoning established cataloging practices \cite{gorman2007rda}. |
---|
285 | And although there is still work on RDA, among others by the Library of Congress, there has been no wider adoption of the standard by the LIS community until now. |
---|
286 | |
---|
287 | \subsection{ESE, Europeana Data Model - EDM} |
---|
288 | |
---|
289 | Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently hosting information about 29 million objects from 2.200 institutions from 36 countries\furl{http://www.pro.europeana.eu/web/guest/content}. |
---|
290 | |
---|
291 | For collecting metadata from the content providers, Europeana originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is too limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana,haslhofer2011data,doerr2010europeana}. |
---|
292 | EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is also already a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the Europeana data in the new format. |
---|
293 | %https://github.com/europeana |
---|
294 | |
---|
295 | %%%%%%%%%%%%%%%%%% |
---|
296 | \section{Controlled Vocabularies, Reference Data, Ontologies} |
---|
297 | \label{refdata} |
---|
298 | |
---|
299 | One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web |
---|
300 | one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative |
---|
301 | \url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}. |
---|
302 | |
---|
303 | Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees. |
---|
304 | |
---|
305 | In the following we inventarize such resources (cf. tables \ref{table:data-ne}, \ref{table:data-concepts}) covering the domains expected to be needed for linking the original dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the glossary \ref{table:vocab-glossary} |
---|
306 | How this resources will be employed is discussed in \ref{sec:values2entities}. |
---|
307 | Additionally, some verbose commentary follows. |
---|
308 | |
---|
309 | %\subsubsection{Named entities} |
---|
310 | |
---|
311 | The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications. |
---|
312 | Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html} |
---|
313 | |
---|
314 | Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} |
---|
315 | |
---|
316 | Also to mention \xne{Yago}, a large knowledge base created by MPI informatik integrating dbpedia, geonames and wordnet\furl{http://www.mpi-inf.mpg.de/yago-naga/yago/} \cite{Suchanek2007yago}. |
---|
317 | |
---|
318 | So we witness a strong general trend towards Semantic Web and Linked Open Data. |
---|
319 | |
---|
320 | %Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain. |
---|
321 | |
---|
322 | %Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type. |
---|
323 | |
---|
324 | %\subsection{Concepts -- Classifications, Taxonomies, \dots} |
---|
325 | |
---|
326 | |
---|
327 | \begin{comment} |
---|
328 | |
---|
329 | VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID} |
---|
330 | |
---|
331 | \subsection{schema.org} |
---|
332 | http://schema.org/docs/datamodel.html |
---|
333 | http://www.w3.org/wiki/WebSchemas/ExternalEnumerations |
---|
334 | |
---|
335 | microdata or |
---|
336 | http://www.w3.org/TR/rdfa-lite/ |
---|
337 | Resource Description Framework in attributes |
---|
338 | |
---|
339 | the entire WorldCat cataloging collection made publicly |
---|
340 | available using Schema.org mark-up with library extensions for use by developers and |
---|
341 | search partners such as Bing, Google, Yahoo! and Yandex |
---|
342 | |
---|
343 | OCLC begins adding linked data to WorldCat by appending |
---|
344 | Schema.org descriptive mark-up to WorldCat.org pages, thereby |
---|
345 | making OCLC member library data available for use by intelligent |
---|
346 | Web crawlers such as Google and Bing |
---|
347 | |
---|
348 | \end{comment} |
---|
349 | |
---|
350 | \section{Summary} |
---|
351 | |
---|
352 | In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology. |
---|
353 | We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications), needed as input in section \ref{sec:values2entities} about mapping values to entities. |
---|
354 | |
---|
355 | |
---|
356 | |
---|
357 | \begin{landscape} |
---|
358 | \begin{table} |
---|
359 | \caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica} |
---|
360 | \label{table:data-ne} |
---|
361 | % \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} } |
---|
362 | \begin{tabu}{ >{\sffamily}l l r X X} |
---|
363 | \hline |
---|
364 | \rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ |
---|
365 | \hline |
---|
366 | VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\ |
---|
367 | GND/p & DNB & 4.6E6 & Persons, universal, lang:de & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\ |
---|
368 | GND/k & '' & 1.2E6 & Organizations, universal, lang:de & \\ |
---|
369 | GND/w & '' & 193,000 & Works, lang:de & \\ |
---|
370 | GND/g & '' & 293.000 & Geographica, lang:de & \\ |
---|
371 | ULAN & Getty & 202,720 / 638,900 & persons, artists & \\ |
---|
372 | TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\ |
---|
373 | %CONA & Getty & & records for cultural works & \\ |
---|
374 | dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\ |
---|
375 | & & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\ |
---|
376 | Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\ |
---|
377 | \href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons& ontology-based portal for LRT & \href{http://www.lt-world.org/kb/}{portal} \\ |
---|
378 | & & 4.600 organizations & & \\ |
---|
379 | Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\ |
---|
380 | PKND & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\ |
---|
381 | \href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI & & archaeologically relevant places & search interface \\ |
---|
382 | %Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\ |
---|
383 | \href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\ |
---|
384 | LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\ |
---|
385 | ISO 3166 & ISO & 249 & Official country codes, lang: en, fr & \\ |
---|
386 | ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\ |
---|
387 | ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\ |
---|
388 | CLAVAS & CLARIN & 2.500 & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\ |
---|
389 | \hline |
---|
390 | \end{tabu} |
---|
391 | \end{table} |
---|
392 | |
---|
393 | \begin{comment} |
---|
394 | \hline |
---|
395 | \end{tabu} |
---|
396 | \end{table} |
---|
397 | |
---|
398 | \begin{table} |
---|
399 | \caption{Controlled vocabularies of named entities -- Geographica} |
---|
400 | \label{table:data-ne-places} |
---|
401 | |
---|
402 | % \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} } |
---|
403 | \begin{tabu}{ >{\sffamily}l l r X X} |
---|
404 | \hline |
---|
405 | \rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ |
---|
406 | |
---|
407 | \end{comment} |
---|
408 | |
---|
409 | |
---|
410 | \begin{table} |
---|
411 | \caption{Taxonomies, Classifications, Thesauri} |
---|
412 | \label{table:data-concepts} |
---|
413 | \begin{tabu}{ >{\sffamily}l l r X X} |
---|
414 | \hline |
---|
415 | \rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ |
---|
416 | \hline |
---|
417 | AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in art and architecture & \\ |
---|
418 | LCSH & LoC & & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\ |
---|
419 | LCC & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\ |
---|
420 | GND/s & DNB & 202.000 & subjects (Schlagwörter), universal, lang:de & \\ |
---|
421 | GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\ |
---|
422 | DDC & OCLC & & universal classification by field of study, multi langs & \href{http://dewey.info/}{dewey.info} \\ |
---|
423 | UDC & & & & \\ |
---|
424 | Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\ |
---|
425 | DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\ |
---|
426 | ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts & \href{http://www.isocat.org}{web-app}, service \\ |
---|
427 | Object Names Thes. & British Museum & & classification of objects in the collection & \\ |
---|
428 | Material Thes. & British Museum & & classification of material & \\ |
---|
429 | Thes. Monument Types & British Museum & & types of monuments & \\ |
---|
430 | Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\ |
---|
431 | Oberbegriffsdatei & DMB & & a set of vocabularies for museums, lang:de & \url{museumsvokabular.de}, PDF, XML dumps\\ |
---|
432 | Iconclass & RKD & 28,000 & taxonomy of subject of an image & \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\ |
---|
433 | \href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools) & \\ |
---|
434 | %Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\ |
---|
435 | \hline |
---|
436 | \end{tabu} |
---|
437 | \end{table} |
---|
438 | |
---|
439 | \end{landscape} |
---|
440 | |
---|
441 | |
---|
442 | |
---|
443 | \begin{table} |
---|
444 | \caption{Glossary of acronyms used in the overview of controlled vocabularies (tables \ref{table:data-ne}, \ref{table:data-concepts}) } |
---|
445 | \label{table:vocab-glossary} |
---|
446 | |
---|
447 | % \begin{tabu}{ >{\sffamily}l p{0.8\textwidth} |
---|
448 | \begin{tabular}{ >{\sffamily}l p{0.8\textwidth}} |
---|
449 | % \hline |
---|
450 | %\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ |
---|
451 | % \hline |
---|
452 | |
---|
453 | AAT & international Architecture and Arts Thesaurus, Getty \\ |
---|
454 | CONA & Cultural Objects Name Authority \\ |
---|
455 | DAI & Deutsches ArchÀologisches Institut \\ |
---|
456 | DDC & Dewey Decimal Classification \\ |
---|
457 | DFKI & Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz \\ |
---|
458 | DMB & Deutscher Museumsbund \\ |
---|
459 | DNB & Deutsche National Bibliothek \\ |
---|
460 | FAST & Faceted Application of Subject Terminology \\ |
---|
461 | Getty & Getty Research Institute curating the \href{http://www.getty.edu/research/tools/vocabularies/index.html}{vocabularies}, part of Getty Trust \\ |
---|
462 | GND & \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library \\ |
---|
463 | GTAA & Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for \& Audiovisual Archives) \\ |
---|
464 | % {quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} \\ |
---|
465 | ISO & International Standardization Organization \\ |
---|
466 | LCCN & Library of Congress Control Number \\ |
---|
467 | LCC & Library of Congress Classification \\ |
---|
468 | LCSH & Library of Congress Subject Headings \\ |
---|
469 | LoC & Library of Congress\furl{http://loc.gov} \\ |
---|
470 | OCLC & Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation \\ |
---|
471 | PKND & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{prometheus} KÃŒnstlerNamensansetzungsDatei\\ |
---|
472 | RKD & Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History \\ |
---|
473 | TGN & Getty Thesaurus of Geographic Names \\ |
---|
474 | UDC & Universal Decimal Classification \\ |
---|
475 | ULAN & Union List of Artist Names \\ |
---|
476 | VIAF & Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries \\ |
---|
477 | \end{tabular} |
---|
478 | \end{table} |
---|
479 | |
---|