Changeset 3681
- Timestamp:
- 10/06/13 18:19:22 (11 years ago)
- Location:
- SMC4LRT
- Files:
-
- 5 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/Outline.tex
r3680 r3681 77 77 \listoffigures 78 78 \listoftodos 79 80 79 \begin{comment} 81 82 80 \input{chapters/Introduction} 83 81 84 82 \input{chapters/Literature} 83 85 84 86 85 \input{chapters/Definitions} -
SMC4LRT/chapters/Data.tex
r3680 r3681 66 66 \label{tab:cmd-profiles} 67 67 \begin{center} 68 \begin{tabu lar}{ r l }69 \hline 70 \ # records & profile \\68 \begin{tabu}{ r l } 69 \hline 70 \rowfont{\itshape\small} \# records & profile \\ 71 71 \hline 72 72 155.403 & Song \\ … … 91 91 873 & teiHeader \\ 92 92 \hline 93 \end{tabu lar}93 \end{tabu} 94 94 \end{center} 95 95 \end{table} … … 98 98 \caption{Top 20 CMD collections, with the respective number of records} 99 99 \begin{center} 100 \begin{tabu lar}{ r l }101 \hline 102 \ # records & colleciton \\100 \begin{tabu}{ r l } 101 \hline 102 \rowfont{\itshape\small} \# records & colleciton \\ 103 103 \hline 104 104 243.129 & Meertens collection: Liederenbank \\ … … 123 123 3.081 & MPI fÃŒr Bildungsforschung \\ 124 124 \hline 125 \end{tabu lar}125 \end{tabu} 126 126 \end{center} 127 127 \end{table} … … 131 131 132 132 133 \section{Other Metadata Formats and Collections }133 \section{Other LRT Metadata Formats and Collections } 134 134 \label{sec:lrt-md-catalogs} 135 135 136 137 Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI? 138 139 The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. 136 Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts. 137 138 Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI??? 140 139 141 140 … … 145 144 It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}: 146 145 \begin{description} 147 \item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/}146 \item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\ 148 147 the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007 149 \item[Dublin Core metadata terms ] \code{/terms/}148 \item[Dublin Core metadata terms ] namespace: \code{/terms/} \\ 150 149 the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency) 151 150 \end{description} … … 177 176 <publisher>New York: Holt</publisher> 178 177 </olac:olac> 179 \end{ 178 \end{lstlisting} 180 179 181 180 OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''. 182 181 183 182 Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}). 183 184 184 185 185 186 \subsection{TEI / teiHeader} … … 187 188 188 189 \begin{quotation} 189 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/}190 \end{quotation}191 encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.192 193 \begin{quotation}194 190 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged] 195 \e bnd{quotation}191 \end{quotation} 196 192 197 193 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}. … … 201 197 There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure. 202 198 203 \subsection{ISLE/IMDI} 204 205 IMDI = ISLE Metadata 206 http://www.mpi.nl/imdi/ 207 208 \begin{quotation} 209 The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools. 210 \end{quotation} 211 212 213 \subsection{LAT, TLA} 214 Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}} 215 216 217 Predecessor of CMDI 218 219 \subsection{MODS/METS} 220 221 Metadata Encoding and Transmission Standard - an XML schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library 222 223 Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. 224 225 226 227 \subsection{ESE, Europeana Data Model - EDM} 228 229 ESE Europeana Semantic Elements- 230 231 EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all} \cite{doerr2010europeana} 232 233 234 he Linked Data approach will play a major role in the European Digital Library ( 235 http://europeana.eu 236 ) 237 and solutions that can handle data expressed in the newly created, RDF-based 238 Europeana Data Model 239 (EDM) 240 are currently being investigated. This report summarizes the results of a study we performed on existing 241 RDF stores, in the context of Europeana and encompasses the following contributions 242 243 244 data.europeana.eu: The Europeana Linked Open Data Pilot\cite{haslhofer2011data} 199 200 \subsection{ISLE/IMDI -- The Language Archive} 201 202 \xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003. 203 204 To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository. 205 206 The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.). 207 208 IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion 209 as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics 210 of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model. 211 245 212 246 213 \subsection{META-SHARE} 247 214 \label{def:META-SHARE} 248 Within the project META-SHARE format 249 250 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components. 251 %In cooperation between metadata teams from CLARIN and META-SHARE 252 253 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. 254 255 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 256 257 258 259 \subsection{META-NET} 260 215 216 META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects. 261 217 262 218 … … 264 220 META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role. 265 221 266 META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).267 268 222 \end{quotation} 269 223 270 The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource. 271 272 A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. 224 Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components. 225 %In cooperation between metadata teams from CLARIN and META-SHARE 226 227 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI) 228 229 The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}. 230 231 Selected member repositories\footnote{7 as of 2013-07} play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. 273 232 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). 274 233 275 276 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 234 One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint. 235 236 ? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 237 277 238 278 239 \subsection{ELRA} 279 240 280 European Language Resources Association\furl{http://elra.info} 241 European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well. 242 The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/} 243 Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world. 281 244 282 245 \begin{quotation} 283 ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section: 246 ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. 247 248 ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. 249 250 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and 251 drafts and concludes distribution agreements on behalf of ELRA. 284 252 \end{quotation} 285 253 286 http://www.elda.org/ 287 Evaluations and Language resources Distribution Agency 288 289 ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns. 290 291 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA. 292 293 ELRA Catalog 294 295 http://catalog.elra.info/ 296 297 298 Universal Catalog+ 299 Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world. 300 301 302 \subsection{Other} 303 304 OAI-ORE - is this a schema? 305 306 307 308 \section{Ontologies, Controlled Vocabularies, Reference Data, Authority Files} 254 \subsection{LDC} 255 256 Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources 257 258 259 \section{Formats and Collections in the World of Libraries} 260 261 There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right. 262 263 %\item[LoC] Library of Congress \url{http://www.loc.gov} 264 %\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm} 265 %\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/} 266 %\end{description} 267 268 \subsection{Formats -- MARC, METS, MODS} 269 270 There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}. 271 272 The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world. 273 274 MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML; 275 276 \xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. 277 It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}. 278 A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}. 279 280 Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html} 281 282 Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using language-based tags rather than numeric ones, 283 more than Dublin Core. One of endorsed schemas to extend (be used inside) METS. 284 285 In 1998 a new Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records 2002 \cite{FRBR1998} 286 and since ?? RDA - Resource Description and Access 287 288 \subsection{ESE, Europeana Data Model - EDM} 289 290 Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently 291 292 originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}. 293 EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana. 294 %https://github.com/europeana 295 296 %%%%%%%%%%%%%%%%%% 297 \section{Controlled Vocabularies, Reference Data, Ontologies} 309 298 \label{refdata} 310 299 311 Based on popular demand, the work on reference data for the SSH-community should cover at least the following dimensions (with tentative denominations of corresponding existing vocabularies): 312 313 \begin{itemize} 314 \item Data Categories / Concepts - ISOcat 315 \item Languages - ISO-639 316 \item Countries - country codes 317 \item Persons - GND, VIAF 318 \item Organizations - GND, VIAF 319 \item Schlagw\"{o}rter/Subjects - GND, LCSH 320 \item Resource Typology - 321 \end{itemize} 322 323 AAT - international Architecture and Arts Thesaurus 324 GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd} 325 GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) 326 VIAF - Virtual International Authority File 327 328 329 Other related relevant activities and initiatives 330 331 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values 332 333 A broader collection of related initiatives can be found at the German National Library website: 334 \furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html} 335 FRBR - Functional Requirements for Bibliographic Records 2002 \cite{FRBR1998} 336 337 RDA - Resource Description and Access 338 http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011) 339 At MPDL, within the escidoc publication platform there seems to be (work on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities} 340 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC). 341 http://eats.readthedocs.org/en/latest/ 342 343 344 \subsection{ISOcat - Data Category Registry} 345 346 ISO12620 347 348 \subsection{Classification Schemes, Taxonomies } 349 LCSH, DDC 350 351 352 \subsection{Other controlled Vocabularies} 353 354 Language codes ISO-639-1 355 356 \subsection{Domain Ontologies, Vocabularies} 357 Organization-Lists 358 LT-World !? 359 360 361 \subsubsection{LT-World} 300 One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web 301 one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative 302 \url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}. 303 304 Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees. 305 306 In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary. 307 How this resources will be employed is discussed in \ref{sec:values2entities}. 308 309 %\subsubsection{Named entities} 310 311 The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications. 312 Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html} 313 314 Yago is a large knowledge integrating dbpedia, geonames and ..?? 315 362 316 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} 363 317 364 365 318 So we witness a strong general trend towards Semantic Web and Linked Open Data. 319 320 %Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain. 321 322 %Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type. 323 324 %\subsection{Concepts -- Classifications, Taxonomies, \dots} 325 326 327 \begin{landscape} 328 \begin{table} 329 \caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica} 330 \label{table:data-ne} 331 % \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} } 332 \begin{tabu}{ >{\sffamily}l l r X X} 333 \hline 334 \rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ 335 \hline 336 VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\ 337 GND/p & DNB & 4.6E6 & Persons, universal, lang:de & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\ 338 GND/k & '' & 1.2E6 & Organizations, universal, lang:de & \\ 339 GND/w & '' & 193,000 & Works, lang:de & \\ 340 GND/g & '' & 293.000 & Geographica, lang:de & \\ 341 ULAN & Getty & 202,720 / 638,900 & persons, artists & \\ 342 TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\ 343 %CONA & Getty & & records for cultural works & \\ 344 dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\ 345 & & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\ 346 Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\ 347 \href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\ 348 Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\ 349 PKND & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\ 350 \href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI & & archaeologically relevant places & search interface \\ 351 %Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\ 352 \href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\ 353 LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\ 354 ISO 3166 & ISO & 249 & Official country codes, lang: en, fr & \\ 355 ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\ 356 ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\ 357 CLAVAS & CLARIN & 2.500 & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\ 358 \hline 359 \end{tabu} 360 \end{table} 361 362 \begin{comment} 363 \hline 364 \end{tabu} 365 \end{table} 366 367 \begin{table} 368 \caption{Controlled vocabularies of named entities -- Geographica} 369 \label{table:data-ne-places} 370 371 % \begin{tabu}{ p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} p{0.2\textwidth} } 372 \begin{tabu}{ >{\sffamily}l l r X X} 373 \hline 374 \rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ 375 376 \end{comment} 377 378 379 \begin{table} 380 \caption{Taxonomies, Classifications, Thesauri} 381 \label{table:data-concepts} 382 \begin{tabu}{ >{\sffamily}l l r X X} 383 \hline 384 \rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\ 385 \hline 386 AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in art and architecture & \\ 387 LCSH & LoC & & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\ 388 LCC & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\ 389 GND/s & DNB & 202.000 & subjects (Schlagwörter), universal, lang:de & \\ 390 GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\ 391 DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\ 392 UDC & & & & \\ 393 Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\ 394 DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\ 395 ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\ 396 Object Names Thesaurus & British Museum & & classification of objects in the collection & \\ 397 Material Thesaurus & British Museum & & classification of material & \\ 398 Thesaurus of Monument Types & British Museum & & types of monuments & \\ 399 Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\ 400 Oberbegriffsdatei & DMB & & a set of vocabularies for museums, lang:de & \url{museumsvokabular.de}, PDF, XML dumps\\ 401 Iconclass & RKD & 28,000 & taxonomy of subject of an image & \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\ 402 \href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools) & \\ 403 %Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\ 404 \hline 405 \end{tabu} 406 \end{table} 407 408 \end{landscape} 366 409 367 410 \begin{description} 368 \item[LDC] Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} 369 \item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/} 411 \item[AAT] international Architecture and Arts Thesaurus, Getty 412 \item[CONA] Cultural Objects Name Authority 413 \item[DAI] Deutsches ArchÀologisches Institut 414 \item[DDC] Dewey Decimal Classification 415 \item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz 416 \item[DMB] Deutscher Museumsbund 417 \item[DNB] Deutsche National Bibliothek 418 \item[FAST] Faceted Application of Subject Terminology 419 \item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust 420 \item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library 421 \item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives) 422 \begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation} 423 \item[ISO] International Standardization Organization 424 \item[LCCN] Library of Congress Control Number 425 \item[LCC] Library of Congress Classification 426 \item[LCSH] Library of Congress Subject Headings 427 \item[LoC] Library of Congress\furl{http://loc.gov} 428 \item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation 429 \item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd} 430 \item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History 431 \item[TGN] Getty Thesaurus of Geographic Names 432 \item[UDC] Universal Decimal Classification 433 \item[ULAN] Union List of Artist Names 434 \item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries 370 435 \end{description} 371 436 372 \section{Other Metadata Catalogs/Collections} 373 \label{sec:other-md-catalogs} 374 375 \subsection{(Digital) Libraries} 376 377 378 General (Libraries, Federations): 379 380 \begin{description} 381 \item[OCLC] \url{http://www.oclc.org} 382 world's biggest Library Federation 383 \item[LoC] Library of Congress \url{http://www.loc.gov} 384 \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm} 385 \item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/} 386 \end{description} 387 388 437 438 \begin{comment} 389 439 390 440 VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID} 391 441 392 http://www.dnb.de/rdf 393 442 \subsection{schema.org} 443 http://schema.org/docs/datamodel.html 444 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations 445 446 microdata or 447 http://www.w3.org/TR/rdfa-lite/ 448 Resource Description Framework in attributes 394 449 395 450 the entire WorldCat cataloging collection made publicly … … 402 457 Web crawlers such as Google and Bing 403 458 404 405 \subsection{schema.org} 406 407 http://schema.org/docs/datamodel.html 408 409 microdata or 410 http://www.w3.org/TR/rdfa-lite/ 411 Resource Description Framework in attributes 412 459 \end{comment} 413 460 414 461 \section{Summary} 415 462 416 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology 417 463 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology. 464 We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications). 465 -
SMC4LRT/chapters/Literature.tex
r3680 r3681 33 33 34 34 \subsubsection{Digital Libraries} 35 \label{lit:digi-lib} 35 36 36 37 In a broader view we should also regard the activities in the world of libraries. … … 44 45 \xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries. 45 46 46 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} has even broader scope, serving as meta-aggregator and portal for European digitised works, encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana). The auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}. 47 47 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana). 48 49 A large number of projects contribute(d) to Europeana. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}. 48 50 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research. 49 51 -
SMC4LRT/chapters/Results.tex
r3680 r3681 220 220 \end{table} 221 221 222 DBNL\_Tekst clarin.eu:cr1:p\_1361876010678, 223 clarin.eu:cr1:p 1366279029218 (private) 222 %DBNL\_Tekst clarin.eu:cr1:p\_1361876010678, clarin.eu:cr1:p 1366279029218 (private) 224 223 225 224 % 226 225 \subsubsection{META-SHARE} 227 226 % 227 \label{reports-meta-share} 228 228 229 229 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components. -
SMC4LRT/utils.tex
r3680 r3681 21 21 \usepackage{tabularx} 22 22 \usepackage{tabu} 23 23 \usepackage{pdflscape} 24 24 25 \usepackage[singlelinecheck=off]{caption} 25 26
Note: See TracChangeset
for help on using the changeset viewer.