Changeset 3681


Ignore:
Timestamp:
10/06/13 18:19:22 (11 years ago)
Author:
vronk
Message:
 
Location:
SMC4LRT
Files:
5 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/Outline.tex

    r3680 r3681  
    7777\listoffigures
    7878\listoftodos
    79 
    8079\begin{comment}
    81 
    8280\input{chapters/Introduction}
    8381
    8482\input{chapters/Literature}
     83
    8584
    8685\input{chapters/Definitions}
  • SMC4LRT/chapters/Data.tex

    r3680 r3681  
    6666\label{tab:cmd-profiles}
    6767\begin{center}
    68   \begin{tabular}{ r l }
    69     \hline
    70 \# records & profile \\
     68  \begin{tabu}{ r l }
     69    \hline
     70\rowfont{\itshape\small} \# records & profile \\
    7171    \hline
    7272155.403 & Song \\
     
    9191873 & teiHeader \\
    9292    \hline
    93   \end{tabular}
     93  \end{tabu}
    9494\end{center}
    9595\end{table}
     
    9898\caption{Top 20 CMD collections, with the respective number of records}
    9999\begin{center}
    100   \begin{tabular}{ r l }
    101     \hline
    102 \# records & colleciton \\
     100  \begin{tabu}{ r l }
     101    \hline
     102\rowfont{\itshape\small} \# records & colleciton \\
    103103    \hline
    104104243.129 & Meertens collection: Liederenbank \\
     
    1231233.081 & MPI fÃŒr Bildungsforschung \\   
    124124\hline
    125   \end{tabular}
     125  \end{tabu}
    126126\end{center}
    127127\end{table}
     
    131131
    132132
    133 \section{Other Metadata Formats and Collections }
     133\section{Other LRT Metadata Formats and Collections }
    134134\label{sec:lrt-md-catalogs}
    135135
    136 
    137 Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI?
    138 
    139 The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
     136Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
     137
     138Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI???
    140139
    141140
     
    145144It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
    146145\begin{description}
    147 \item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/}
     146\item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\
    148147the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
    149 \item[Dublin Core metadata terms ] \code{/terms/}
     148\item[Dublin Core metadata terms ]  namespace: \code{/terms/} \\
    150149the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
    151150\end{description}
     
    177176   <publisher>New York: Holt</publisher>
    178177</olac:olac>
    179 \end{
     178\end{lstlisting}
    180179
    181180OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
    182181
    183182Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
     183
     184
    184185
    185186\subsection{TEI / teiHeader}
     
    187188
    188189\begin{quotation}
    189 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/}
    190 \end{quotation}
    191 encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
    192 
    193 \begin{quotation}
    194190 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
    195 \ebnd{quotation}
     191\end{quotation}
    196192
    197193TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
     
    201197There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
    202198
    203 \subsection{ISLE/IMDI}
    204 
    205 IMDI = ISLE Metadata 
    206 http://www.mpi.nl/imdi/
    207 
    208 \begin{quotation}
    209 The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
    210 \end{quotation}
    211 
    212 
    213 \subsection{LAT, TLA}
    214 Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
    215 
    216 
    217 Predecessor of CMDI
    218 
    219 \subsection{MODS/METS}
    220 
    221 Metadata Encoding and Transmission Standard - an XML schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library
    222 
    223 Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.
    224 
    225 
    226 
    227 \subsection{ESE, Europeana Data Model - EDM}
    228 
    229 ESE Europeana Semantic Elements-
    230 
    231 EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all} \cite{doerr2010europeana}
    232 
    233 
    234 he Linked Data approach will play a major role in the European Digital Library (
    235 http://europeana.eu
    236 )
    237 and solutions that can handle data expressed in the newly created, RDF-based
    238 Europeana Data Model
    239 (EDM)
    240 are currently being investigated. This report summarizes the results of a study we performed on existing
    241 RDF stores, in the context of Europeana and encompasses the following contributions
    242 
    243 
    244 data.europeana.eu: The Europeana Linked Open Data Pilot\cite{haslhofer2011data}
     199
     200\subsection{ISLE/IMDI -- The Language Archive}
     201
     202\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
     203
     204To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
     205
     206The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
     207
     208IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion
     209as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics
     210of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model.
     211
    245212
    246213\subsection{META-SHARE}
    247214\label{def:META-SHARE}
    248 Within the project META-SHARE format
    249 
    250 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
    251 %In cooperation between metadata teams from CLARIN and META-SHARE
    252 
    253 The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
    254 
    255 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
    256 
    257 
    258 
    259 \subsection{META-NET}
    260 
     215
     216META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
    261217
    262218
     
    264220META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
    265221
    266 META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
    267 
    268222\end{quotation}
    269223
    270 The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
    271 
    272 A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
     224Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
     225%In cooperation between metadata teams from CLARIN and META-SHARE
     226
     227The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
     228
     229The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
     230
     231Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
    273232The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
    274233
    275 
    276 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
     234One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
     235
     236? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
     237
    277238
    278239\subsection{ELRA}
    279240
    280 European Language Resources Association\furl{http://elra.info}
     241European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well.
     242The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
     243Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
    281244
    282245\begin{quotation}
    283 ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
     246ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
     247
     248ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
     249
     250ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
     251drafts and concludes distribution agreements on behalf of ELRA.
    284252\end{quotation}
    285253
    286 http://www.elda.org/
    287 Evaluations and Language resources Distribution Agency
    288 
    289 ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns.
    290 
    291 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
    292 
    293 ELRA Catalog
    294 
    295 http://catalog.elra.info/
    296 
    297 
    298 Universal Catalog+
    299  Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
    300 
    301 
    302 \subsection{Other}
    303 
    304 OAI-ORE - is this a schema?
    305 
    306 
    307 
    308 \section{Ontologies, Controlled Vocabularies, Reference Data, Authority Files}
     254\subsection{LDC}
     255
     256Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources
     257
     258
     259\section{Formats and Collections in the World of Libraries}
     260
     261There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right.
     262
     263%\item[LoC] Library of Congress \url{http://www.loc.gov}
     264%\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
     265%\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
     266%\end{description}
     267
     268\subsection{Formats  -- MARC, METS, MODS}
     269
     270There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
     271
     272The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
     273
     274MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
     275
     276\xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
     277It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}.
     278A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
     279
     280Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
     281
     282Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
     283more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
     284
     285In 1998 a new  Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
     286and since ?? RDA - Resource Description and Access
     287
     288\subsection{ESE, Europeana Data Model - EDM}
     289
     290Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently
     291
     292originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}.
     293EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.
     294%https://github.com/europeana
     295
     296%%%%%%%%%%%%%%%%%%
     297\section{Controlled Vocabularies, Reference Data, Ontologies}
    309298\label{refdata}
    310299
    311 Based on popular demand, the work on reference data for the SSH-community should cover at least the following dimensions (with tentative denominations of corresponding existing vocabularies):
    312 
    313 \begin{itemize}
    314 \item Data Categories / Concepts - ISOcat
    315 \item Languages - ISO-639
    316 \item Countries - country codes
    317 \item Persons - GND, VIAF
    318 \item Organizations - GND, VIAF
    319 \item Schlagw\"{o}rter/Subjects - GND, LCSH
    320 \item Resource Typology -
    321 \end{itemize}
    322 
    323 AAT - international Architecture and Arts Thesaurus
    324 GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd}
    325 GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
    326 VIAF - Virtual International Authority File
    327 
    328 
    329 Other related relevant activities and initiatives
    330 
    331 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values
    332 
    333 A broader collection of related initiatives can be found at the German National Library website:
    334 \furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html}
    335 FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
    336 
    337 RDA - Resource Description and Access
    338 http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
    339 At MPDL, within the escidoc publication platform there seems to be (work  on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities}
    340 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC).
    341 http://eats.readthedocs.org/en/latest/
    342 
    343 
    344 \subsection{ISOcat - Data Category Registry}
    345 
    346 ISO12620
    347 
    348 \subsection{Classification Schemes, Taxonomies }
    349 LCSH, DDC
    350 
    351 
    352 \subsection{Other controlled Vocabularies}
    353 
    354 Language codes ISO-639-1
    355 
    356 \subsection{Domain Ontologies, Vocabularies}
    357 Organization-Lists
    358 LT-World !?
    359 
    360 
    361 \subsubsection{LT-World}
     300One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
     301one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
     302\url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
     303
     304Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
     305
     306In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.
     307How this resources will be employed is discussed in \ref{sec:values2entities}.
     308
     309%\subsubsection{Named entities}
     310
     311The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
     312Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
     313
     314Yago is a large knowledge integrating dbpedia, geonames and ..??
     315
    362316Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
    363317
    364 
    365 
     318So we witness a strong general trend towards Semantic Web and Linked Open Data.
     319
     320%Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain.
     321
     322%Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type.
     323
     324%\subsection{Concepts -- Classifications, Taxonomies, \dots}
     325
     326
     327\begin{landscape}
     328\begin{table}
     329\caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica}
     330\label{table:data-ne}
     331%  \begin{tabu}{  p{0.2\textwidth}  p{0.2\textwidth}  p{0.2\textwidth}   p{0.2\textwidth}   p{0.2\textwidth} }
     332  \begin{tabu}{  >{\sffamily}l l r X X}
     333    \hline
     334\rowfont{\itshape\small} name & provider & size (items / facts)  & description & access \\
     335    \hline
     336VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\
     337GND/p & DNB & 4.6E6 & Persons, universal, lang:de  & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\
     338GND/k & '' & 1.2E6 & Organizations, universal, lang:de  & \\
     339GND/w & '' & 193,000 & Works, lang:de  & \\
     340GND/g & '' & 293.000 & Geographica, lang:de & \\
     341ULAN & Getty & 202,720 / 638,900 & persons, artists     & \\
     342TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\
     343%CONA & Getty & & records for cultural works & \\       
     344dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\
     345& & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
     346Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
     347\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\
     348Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
     349PKND     & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
     350\href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI &  & archaeologically relevant places & search interface \\
     351%Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\
     352\href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\
     353LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\
     354ISO 3166 & ISO & 249 & Official country codes, lang: en, fr &   \\
     355ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\
     356ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\
     357CLAVAS & CLARIN & 2.500  & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
     358\hline
     359\end{tabu}
     360\end{table}
     361
     362\begin{comment}
     363\hline
     364  \end{tabu}
     365\end{table}
     366
     367\begin{table}
     368\caption{Controlled vocabularies of named entities -- Geographica}
     369\label{table:data-ne-places}
     370
     371%  \begin{tabu}{  p{0.2\textwidth}  p{0.2\textwidth}  p{0.2\textwidth}   p{0.2\textwidth}   p{0.2\textwidth} }
     372  \begin{tabu}{  >{\sffamily}l l r X X}
     373    \hline
     374\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
     375
     376\end{comment}
     377
     378
     379\begin{table}
     380\caption{Taxonomies, Classifications, Thesauri}
     381\label{table:data-concepts}
     382  \begin{tabu}{  >{\sffamily}l l r X X}
     383    \hline
     384\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
     385    \hline
     386AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in  art and architecture &  \\
     387LCSH & LoC &  & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\
     388LCC  & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\
     389GND/s & DNB & 202.000 & subjects (Schlagwörter), universal, lang:de & \\
     390GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
     391DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\
     392UDC & & & & \\
     393Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
     394 DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
     395ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\
     396Object Names Thesaurus & British Museum & &  classification of objects in the collection & \\
     397Material Thesaurus & British Museum & & classification of material & \\
     398Thesaurus of Monument Types & British Museum & & types of monuments & \\
     399Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
     400Oberbegriffsdatei  & DMB & & a set of vocabularies for museums, lang:de  & \url{museumsvokabular.de}, PDF, XML dumps\\
     401Iconclass & RKD & 28,000 & taxonomy of subject of an image &  \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\
     402\href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools)  &  \\
     403%Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\
     404\hline
     405\end{tabu}
     406\end{table}
     407
     408\end{landscape}
    366409
    367410\begin{description}
    368 \item[LDC]  Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/}
    369 \item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
     411\item[AAT] international Architecture and Arts Thesaurus, Getty
     412\item[CONA] Cultural Objects Name Authority
     413\item[DAI] Deutsches ArchÀologisches Institut
     414\item[DDC] Dewey Decimal Classification
     415\item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz
     416\item[DMB] Deutscher Museumsbund
     417\item[DNB] Deutsche National Bibliothek
     418\item[FAST] Faceted Application of Subject Terminology
     419\item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust
     420\item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library
     421\item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
     422\begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation}
     423\item[ISO] International Standardization Organization
     424\item[LCCN] Library of Congress Control Number
     425\item[LCC] Library of Congress Classification
     426\item[LCSH] Library of Congress Subject Headings
     427\item[LoC] Library of Congress\furl{http://loc.gov}
     428\item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation
     429\item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd}
     430\item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History
     431\item[TGN] Getty Thesaurus of Geographic Names
     432\item[UDC] Universal Decimal Classification                             
     433\item[ULAN] Union List of Artist Names
     434\item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries
    370435\end{description}
    371436
    372 \section{Other Metadata Catalogs/Collections}
    373 \label{sec:other-md-catalogs}
    374 
    375 \subsection{(Digital) Libraries}
    376 
    377 
    378 General (Libraries, Federations):
    379 
    380 \begin{description}
    381 \item[OCLC] \url{http://www.oclc.org}
    382     world's biggest Library Federation
    383 \item[LoC] Library of Congress \url{http://www.loc.gov}
    384 \item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
    385 \item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
    386 \end{description}
    387 
    388 
     437
     438\begin{comment}
    389439
    390440VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
    391441
    392 http://www.dnb.de/rdf
    393 
     442\subsection{schema.org}
     443http://schema.org/docs/datamodel.html
     444http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
     445
     446microdata or
     447http://www.w3.org/TR/rdfa-lite/
     448 Resource Description Framework in attributes
    394449
    395450the entire WorldCat cataloging collection made publicly
     
    402457Web crawlers such as Google and Bing
    403458
    404 
    405 \subsection{schema.org}
    406 
    407 http://schema.org/docs/datamodel.html
    408 
    409 microdata or
    410 http://www.w3.org/TR/rdfa-lite/
    411  Resource Description Framework in attributes
    412 
     459\end{comment}
    413460
    414461\section{Summary}
    415462
    416 In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology
    417 
     463In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
     464We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications).
     465
  • SMC4LRT/chapters/Literature.tex

    r3680 r3681  
    3333
    3434\subsubsection{Digital Libraries}
     35\label{lit:digi-lib}
    3536
    3637In a broader view we should also regard the activities in the world of libraries.
     
    4445\xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
    4546
    46 \xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} has even broader scope, serving as meta-aggregator and portal for European digitised works, encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana). The auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
    47 
     47\xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana).
     48
     49A large number of projects contribute(d) to Europeana. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
    4850Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.
    4951
  • SMC4LRT/chapters/Results.tex

    r3680 r3681  
    220220\end{table}
    221221
    222 DBNL\_Tekst clarin.eu:cr1:p\_1361876010678,
    223 clarin.eu:cr1:p 1366279029218 (private)
     222%DBNL\_Tekst clarin.eu:cr1:p\_1361876010678, clarin.eu:cr1:p 1366279029218 (private)
    224223
    225224%
    226225\subsubsection{META-SHARE}
    227226%
     227\label{reports-meta-share}
    228228
    229229META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
  • SMC4LRT/utils.tex

    r3680 r3681  
    2121\usepackage{tabularx}
    2222\usepackage{tabu}
    23        
     23\usepackage{pdflscape} 
     24
    2425\usepackage[singlelinecheck=off]{caption}
    2526
Note: See TracChangeset for help on using the changeset viewer.