1 | |
---|
2 | \chapter{Analysis of the data landscape} |
---|
3 | \label{ch:data} |
---|
4 | This section gives an overview of existing standards and formats for metadata and content annotations in the field of Language Resources and Technology together with a description of their characteristics and their respective usage in the projects and initiatives. |
---|
5 | |
---|
6 | |
---|
7 | \section{Metadata Formats} |
---|
8 | |
---|
9 | |
---|
10 | \subsection{Component Metadata Framework} |
---|
11 | \label{def:CMD} |
---|
12 | |
---|
13 | The \emph{Component Metadata Framework} (CMD) is the data model of the CLARIN metadata infrastructure. (See \ref{CMDI} for information about the infrastructure. The XML-schema of CMD -- the general-component-schema -- is featured in appendix \ref{lst:general-component-schema}.) |
---|
14 | CMD is used to define the so-called \var{profiles} being constructed out of reusable \var{components} -- collections of metadata fields. The components can contain other components and they can be reused in multiple profiles. Profile itself is just a special kind of a component (a sub class), with some additional administrative information. |
---|
15 | The actual core provision for semantic interoperability is the requirement, that each CMD element (i.e. metadata field) refers ``via a PID to exactly one data category\footnote{persistently referenceable concept definition} (cf. \ref{def:DCR}), thus |
---|
16 | indicating unambiguously how the content of the field in a metadata description should be interpreted'' \cite{Broeder+2010}. |
---|
17 | |
---|
18 | While the primary registry for data categories used in CMD is the \xne{ISOcat} Data Category Registry (cf. \ref{def:DCR}), other authoritative sources are accepted (so-called ``trusted registries''), especially the set of terms maintained by the Dublin Core Metadata Initiative \cite{DCMI:2005}. |
---|
19 | |
---|
20 | Once the profiles are defined they are transformed into a XML-Schema, that prescribes the structure of the instance records. |
---|
21 | The generated schema also conveys as annotation the information about the referenced data categories. |
---|
22 | |
---|
23 | |
---|
24 | \subsubsection{CMD Profiles } |
---|
25 | In the CR 124\footnote{All numbers are as of 2013-06 if not stated otherwise} public Profiles and 696 Components are defined. Table \ref{table:dev_profiles} shows the development of the CR and DCR population over time. |
---|
26 | |
---|
27 | Next to the `native' CMD profiles a number of profiles have been created that implement existing metadata formats, like OLAC/DCMI-terms, TEI Header or the META-SHARE schema. The resulting profiles proof the flexibility/expressi\-vi\-ty of the CMD metamodel. The individual profiles differ also very much in their structure -- next to flat profiles with just one level of components or elements with 5 to 20 fields (\textit{dublincore}, \textit{collection}, the set of \textit{Bamdes}-profiles) there are complex profiles with up to 10 levels (\textit{ExperimentProfile}, profiles for describing Web Services ) and a few hundred elements. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project \cite{Gavrilidou2012meta} for describing corpora, with 419 components and 1587 elements |
---|
28 | (when expanded\footnote{The reusability of components results in an element expansion, i.e., elements of a component (e.g. \textit{Contact}) included by three other components (\textit{Project}, \textit{Institution}, \textit{Access}) will appear three times in the instantiated record.}). |
---|
29 | |
---|
30 | |
---|
31 | \begin{table} |
---|
32 | \caption{The development of defined profiles and DCs over time} |
---|
33 | \label{table:dev_profiles} |
---|
34 | \begin{tabular}{ l | r | r | r | r } |
---|
35 | \hline |
---|
36 | date & 2011-01 & 2012-06 & 2013-01 & 2013-06 \\ |
---|
37 | \hline |
---|
38 | Profiles & 40 & 53 & 87 & 124 \\ |
---|
39 | Distinct Components & 164 & 298 & 542 & 828 \\ |
---|
40 | Expanded Components & 1055 & 1536 & 2904 & 5757 \\ |
---|
41 | Distinct Elements & 511 & 893 & 1505 & 2399 \\ |
---|
42 | Expanded Elements & 1971 & 3030 & 5754 & 13232 \\ |
---|
43 | Distinct data categories & 203 & 266 & 436 & 499 \\ |
---|
44 | Data categories in the Metadata profile & 277 & 712 & 774 & 791 \\ |
---|
45 | Ratio of elements without DCs & 24,7\% & 17,6\% & 21,5\% & 26,5\% \\ |
---|
46 | Components with DCs & 28 & 67 & 115 & 140 \\ |
---|
47 | |
---|
48 | \hline |
---|
49 | \end{tabular} |
---|
50 | \end{table} |
---|
51 | |
---|
52 | |
---|
53 | \subsection{Instance Data} |
---|
54 | |
---|
55 | |
---|
56 | \todoin{ add historical perspective on data - list overall} |
---|
57 | |
---|
58 | The main CLARIN OAI-PMH harvester\footnote{\url{http://catalog.clarin.eu/oai-harvester/}} |
---|
59 | collects records from 69 providers on daily basis. The complete dataset amounts to 540.065 records. |
---|
60 | 16 of the providers offer CMDI records, the other 53 provide OLAC/DC records\label{info:olac-records}, that are being converted into the corresponding CMD profile after harvesting. Next to these 81.226 original OLAC records, there a few providers offering their OLAC or DCMI-terms records already converted into CMDI, thus all in all OLAC, DCMI-terms records amount to 139.152. |
---|
61 | On the other hand, some of the comparatively few providers of `native' CMD records expose multiple profiles (e.g. Meertens Institute uses 12 different profiles.) So we encounter both situations: one profile being used by many providers and one provider using many profiles. |
---|
62 | |
---|
63 | |
---|
64 | \begin{table} |
---|
65 | \caption{Top 20 profiles, with the respective number of records} |
---|
66 | \begin{center} |
---|
67 | \begin{tabular}{ r | l } |
---|
68 | \# records & profile \\ |
---|
69 | \hline |
---|
70 | 155.403 & Song \\ |
---|
71 | 138.257 & Session \\ |
---|
72 | 92.996 & OLAC-DcmiTerms \\ |
---|
73 | 46.156 & DcmiTerms \\ |
---|
74 | 28.448 & SongScan \\ |
---|
75 | 21.256 & SourceScan \\ |
---|
76 | 19.059 & LiteraryCorpusProfile \\ |
---|
77 | 16519 & Source \\ |
---|
78 | 13626 & imdi-corpus \\ |
---|
79 | 10610 & media-session-profile \\ |
---|
80 | 7961 & SongAudio \\ |
---|
81 | 7557 & SymbolicMusicNotation \\ |
---|
82 | 4485 & LCC DataProviderProfile \\ |
---|
83 | 4485 & SourceProfile \\ |
---|
84 | 4417 & Text \\ |
---|
85 | 1982 & Soundbites-recording \\ |
---|
86 | 1530 & Performer \\ |
---|
87 | 1475 & ArthurianFiction \\ |
---|
88 | 939 & LrtInventoryResource \\ |
---|
89 | 873 & teiHeader \\ |
---|
90 | \hline |
---|
91 | \end{tabular} |
---|
92 | \end{center} |
---|
93 | \end{table} |
---|
94 | |
---|
95 | \begin{table} |
---|
96 | \caption{Top 20 collections, with the respective number of records} |
---|
97 | \begin{center} |
---|
98 | \begin{tabular}{ r | l } |
---|
99 | \# records & colleciton \\ |
---|
100 | \hline |
---|
101 | 243.129 & Meertens collection: Liederenbank \\ |
---|
102 | 46.658 & DK-CLARIN Repository \\ |
---|
103 | 46.156 & Nederlands Instituut voor Beeld en Geluid Academia collectie \\ |
---|
104 | 29.266 & childes \\ |
---|
105 | 24.583 & DoBeS archive \\ |
---|
106 | 23.185 & Language and Cognition \\ |
---|
107 | 14.593 & talkbank \\ |
---|
108 | 14.363 & Acquisition \\ |
---|
109 | 14.320 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\ |
---|
110 | 12.893 & MPI CGN \\ |
---|
111 | 10.628 & Bavarian Archive for Speech Signals (BAS) \\ |
---|
112 | 7.964 & Pacific And Regional Archive for Digital Sources in Endangered Cultures\\ |
---|
113 | 7.348 & WALS RefDB \\ |
---|
114 | 5.689 & Lund Corpora \\ |
---|
115 | 4.640 & Oxford Text Archive \\ |
---|
116 | 4.492 & Leipzig Corpora Collection \\ |
---|
117 | 3.539 & Institut fÃŒr Deutsche Sprache, CLARIN-D Zentrum, Mannheim \\ |
---|
118 | 3.280 & A Digital Archive of Research Papers in Computational Linguistics \\ |
---|
119 | 3.147 & CLARIN NL \\ |
---|
120 | 3.081 & MPI fÃŒr Bildungsforschung \\ |
---|
121 | \hline |
---|
122 | \end{tabular} |
---|
123 | \end{center} |
---|
124 | \end{table} |
---|
125 | |
---|
126 | We can also observe a large disparity on the amount of records between individual providers and profiles. Almost half of all records is provided by the Meertens Institute (\textit{Liederenbank} and \textit{Soundbites} collections), another 25\% by MPI for Psycholinguistics (\textit{corpus} + \textit{Session} records from the \textit{The Language Archive}). On the other hand there are 25 profiles that have less than 10 instances. This can be owing both to the state of the respective project (resources and records still being prepared) and the modelled granularity level (collection vs. individual resource). |
---|
127 | |
---|
128 | |
---|
129 | \subsection{Dublin Core + OLAC} |
---|
130 | |
---|
131 | DC, OLAC |
---|
132 | |
---|
133 | openarchives register: \url{http://www.openarchives.org/Register/BrowseSites} |
---|
134 | 2006 OAI-repositories |
---|
135 | |
---|
136 | DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/} |
---|
137 | |
---|
138 | DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/} |
---|
139 | |
---|
140 | \label{def:OLAC} |
---|
141 | A more specific version of the dublincore terms, adapted to the needs of the linguistic community is the |
---|
142 | OLAC\furl{http://www.language-archives.org/}format\cite{Bird2001} |
---|
143 | |
---|
144 | OLAC \cite{Simons2003OLAC}. |
---|
145 | |
---|
146 | \todoin{check http://www.language-archives.org/OLAC/metadata.html} |
---|
147 | |
---|
148 | \begin{quotation} |
---|
149 | The OLAC metadata set is the set of metadata elements that participating archives have agreed to use for describing language resources. Uniform description across archives is ensured by limiting the values of certain metadata elements to the use of terms from agreed-upon controlled vocabularies. The OLAC metadata set is equally applicable whether the resources are available online or not. The metadata set consists of the fifteen elements of the Dublin Core Metadata Set, plus the refinements and encoding schemes of the DCMI Metadata Termsâa widely accepted standard for describing resources of all types. To this general standard, OLAC adds encoding schemes that are designed specifically for describing language resources, such as subject language and linguistic data type. The OLAC Metadata Usage Guidelines describe (with examples) all the elements, refinements, and encoding schemes that may be used in OLAC metadata descriptions. The OLAC Metadata standard defines the XML format that is used for the interchange of metadata descriptions among participating archives. |
---|
150 | \end{quotation} |
---|
151 | |
---|
152 | |
---|
153 | |
---|
154 | |
---|
155 | \subsection{TEI / teiHeader} |
---|
156 | TEI/teiHeader/ODD, |
---|
157 | |
---|
158 | \subsection{ISLE/IMDI} |
---|
159 | |
---|
160 | \subsection{MODS/METS} |
---|
161 | |
---|
162 | \subsection{Europeana Data Model - EDM} |
---|
163 | |
---|
164 | \subsection{META-SHARE} |
---|
165 | META-SHARE is another multinational project aiming to build an infrastructure for language resource\cite{Piperidis2012meta}, however focusing more on Human Language Technologies domain.\furl{http://meta-share.eu} |
---|
166 | |
---|
167 | \begin{quotation} |
---|
168 | META-NET is designing and implementing META-SHARE, a sustainable network of repositories of language data, tools and related web services documented with high-quality metadata, aggregated in central inventories allowing for uniform search and access to resources. Data and tools can be both open and with restricted access rights, free and for-a-fee. META-SHARE targets existing but also new and emerging language data, tools and systems required for building and evaluating new technologies, products and services. |
---|
169 | \end{quotation} |
---|
170 | |
---|
171 | \begin{quotation} |
---|
172 | META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role. |
---|
173 | |
---|
174 | META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.). |
---|
175 | |
---|
176 | \end{quotation} |
---|
177 | |
---|
178 | The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource. |
---|
179 | |
---|
180 | A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. |
---|
181 | The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). |
---|
182 | |
---|
183 | |
---|
184 | MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} |
---|
185 | |
---|
186 | |
---|
187 | \subsection{Other} |
---|
188 | |
---|
189 | OAI-ORE - is this a schema? |
---|
190 | |
---|
191 | |
---|
192 | |
---|
193 | \section{Content/Annotation Formats} |
---|
194 | |
---|
195 | CHILDES, TEI, EAF! |
---|
196 | (CES/XCES) |
---|
197 | Open Annotation Collaboration (OAC)\footnote{\url{http://openannotation.org/}} |
---|
198 | |
---|
199 | [LAF] Linguistic Annotation Framework |
---|
200 | |
---|
201 | |
---|
202 | |
---|
203 | \section{Ontologies, Controlled Vocabularies, Reference Data, Authority Files} |
---|
204 | \label{refdata} |
---|
205 | |
---|
206 | Based on popular demand, the work on reference data for the SSH-community should cover at least the following dimensions (with tentative denominations of corresponding existing vocabularies): |
---|
207 | |
---|
208 | \begin{itemize} |
---|
209 | \item Data Categories / Concepts - ISOcat |
---|
210 | \item Languages - ISO-639 |
---|
211 | \item Countries - country codes |
---|
212 | \item Persons - GND, VIAF |
---|
213 | \item Organizations - GND, VIAF |
---|
214 | \item Schlagwörter/Subjects - GND, LCSH |
---|
215 | \item Resource Typology - |
---|
216 | \end{itemize} |
---|
217 | |
---|
218 | AAT - international Architecture and Arts Thesaurus |
---|
219 | GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd} |
---|
220 | GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) |
---|
221 | VIAF - Virtual International Authority File |
---|
222 | |
---|
223 | |
---|
224 | Other related relevant activities and initiatives |
---|
225 | |
---|
226 | A broader collection of related initiatives can be found at the German National Library website: |
---|
227 | \furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html} |
---|
228 | FRBR - Functional Requirements for Bibliographic Records |
---|
229 | RDA - Resource Description and Access |
---|
230 | http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011) |
---|
231 | At MPDL, within the escidoc publication platform there seems to be (work on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities} |
---|
232 | Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities â developed at the New Zealand Electronic Text Centre (NZETC). |
---|
233 | http://eats.readthedocs.org/en/latest/ |
---|
234 | |
---|
235 | |
---|
236 | \subsection{ISOcat - Data Category Registry} |
---|
237 | |
---|
238 | ISO12620 |
---|
239 | |
---|
240 | \subsection{Classification Schemes, Taxonomies } |
---|
241 | LCSH, DDC |
---|
242 | |
---|
243 | |
---|
244 | \subsection{Other controlled Vocabularies} |
---|
245 | Tagsets: STTS |
---|
246 | Language codes ISO-639-1 |
---|
247 | |
---|
248 | \subsection{Domain Ontologies, Vocabularies} |
---|
249 | Organization-Lists |
---|
250 | LT-World !? |
---|
251 | |
---|
252 | |
---|
253 | |
---|
254 | \section{LRT Metadata Catalogs/Collections} |
---|
255 | \label{sec:lrt-md-catalogs} |
---|
256 | \todoin{Overview of catalogs, name, since, \#providers, \#resources} |
---|
257 | |
---|
258 | \todoin{[DFKI/LT-World] - collection or ontology} |
---|
259 | |
---|
260 | \subsection{CMDI} |
---|
261 | collections, profiles/Terms, ResourceTypes! |
---|
262 | |
---|
263 | \subsection{OLAC} |
---|
264 | |
---|
265 | \subsection{LAT, TLA} |
---|
266 | Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}} |
---|
267 | |
---|
268 | \subsection{META-NET} |
---|
269 | |
---|
270 | |
---|
271 | \subsection{ELRA} |
---|
272 | |
---|
273 | \subsection{Other} |
---|
274 | |
---|
275 | |
---|
276 | \begin{description} |
---|
277 | \item[LDC] Linguistic Data Consortium |
---|
278 | \item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/} |
---|
279 | \end{description} |
---|
280 | |
---|
281 | \section{Other Metadata Catalogs/Collections} |
---|
282 | \label{sec:other-md-catalogs} |
---|
283 | |
---|
284 | \subsection{(Digital) Libraries} |
---|
285 | |
---|
286 | |
---|
287 | General (Libraries, Federations): |
---|
288 | |
---|
289 | \begin{description} |
---|
290 | \item[OCLC] \url{http://www.oclc.org} |
---|
291 | world's biggest Library Federation |
---|
292 | \item[LoC] Library of Congress \url{http://www.loc.gov} |
---|
293 | \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm} |
---|
294 | \item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/} |
---|
295 | \end{description} |
---|
296 | |
---|
297 | |
---|
298 | |
---|
299 | |
---|
300 | \section{Summary} |
---|
301 | |
---|
302 | In this chapter, we gave an overview of the existing formats and dataset in the broad context of Language Resources and Technology |
---|
303 | |
---|