Context Navigation

← Previous Changeset
Next Changeset →

Changeset 3681

Timestamp:

10/06/13 18:19:22 (11 years ago)

Author:

vronk

Message:

Location:

SMC4LRT

Files:

: 5 edited

Outline.tex (modified) (1 diff)
chapters/Data.tex (modified) (11 diffs)
chapters/Literature.tex (modified) (2 diffs)
chapters/Results.tex (modified) (1 diff)
utils.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/Outline.tex

-                      r3680
+                      r3681
 \listoffigures
 \listoftodos
 \begin{comment}
 \input{chapters/Introduction}
 \input{chapters/Literature}
 \input{chapters/Definitions}

SMC4LRT/chapters/Data.tex

-                      r3680
+                      r3681
 \label{tab:cmd-profiles}
 \begin{center}
   \begin{tabular}{ r l }
     \hline
 \# records & profile \\
+  \begin{tabu}{ r l }
+    \hline
+\rowfont{\itshape\small} \# records & profile \\
     \hline
 .403 & Song \\
 …
 & teiHeader \\
     \hline
   \end{tabular}
+  \end{tabu}
 \end{center}
 \end{table}
 …
 \caption{Top 20 CMD collections, with the respective number of records}
 \begin{center}
   \begin{tabular}{ r l }
     \hline
 \# records & colleciton \\
+  \begin{tabu}{ r l }
+    \hline
+\rowfont{\itshape\small} \# records & colleciton \\
     \hline
 .129 & Meertens collection: Liederenbank \\
 …
 .081 & MPI fÃŒr Bildungsforschung \\
 \hline
   \end{tabular}
+  \end{tabu}
 \end{center}
 \end{table}
 …
 \section{Other Metadata Formats and Collections }
+\section{Other LRT Metadata Formats and Collections }
 \label{sec:lrt-md-catalogs}
+Riley and Becker \cite{Riley2010seeing} put the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI?
+The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology.
+Next to CLARIN and CMDI, there is a hoist of related previous and concurrent work. In the following, we briefly introduce some  formats and data collections established in the field and, where applicable, we also sketch the ties with CMDI and existing integration efforts.
+Some overview/survey works regarding existing formats are: The CLARIN deliverable \textit{Interoperability and Standards} \cite{CLARIN_D5.C-3} provides overview of standards, vocabularies and other normative/standardization work in the field of Language Resources and Technology. And \textit{Seeing standards: a visualization of the metadata universe} by Riley and Becker \cite{Riley2010seeing} putting the overwhelming amount of existing metadata standards into a systematic comprehensive overview analyzing the use of standards from four aspects: community, domain, function, and purpose. Though despite its aspiration on comprehensiveness it leaves out some of the formats relevant in the context of this work: IMDI, EDM, ESE, TEI???
 …
 It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
 \begin{description}
 \item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/}
+\item[Dublin Core Metadata Element Set (DCMES) ] namespace: \code{/elements/1.1/}\\
 the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
 \item[Dublin Core metadata terms ] \code{/terms/}
+\item[Dublin Core metadata terms ]  namespace: \code{/terms/} \\
 the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
 \end{description}
 …
    <publisher>New York: Holt</publisher>
 </olac:olac>
 \end{
+\end{lstlisting}
 OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
 Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
 \subsection{TEI / teiHeader}
 …
 \begin{quotation}
-The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/}
-\end{quotation}
-encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
-\begin{quotation}
  The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
 \ebnd{quotation}
+\end{quotation}
 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
 …
 There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
+\subsection{ISLE/IMDI}
+IMDI = ISLE Metadata
+http://www.mpi.nl/imdi/
+\begin{quotation}
+The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
+\end{quotation}
+\subsection{LAT, TLA}
+Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
+Predecessor of CMDI
+\subsection{MODS/METS}
+Metadata Encoding and Transmission Standard - an XML schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library
+Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.
+\subsection{ESE, Europeana Data Model - EDM}
+ESE Europeana Semantic Elements-
+EDM\furl{http://europeana.ontotext.com/resource/edm/hasType?role=all} \cite{doerr2010europeana}
+he Linked Data approach will play a major role in the European Digital Library (
+http://europeana.eu
+)
+and solutions that can handle data expressed in the newly created, RDF-based
+Europeana Data Model
+(EDM)
+are currently being investigated. This report summarizes the results of a study we performed on existing
+RDF stores, in the context of Europeana and encompasses the following contributions
+data.europeana.eu: The Europeana Linked Open Data Pilot\cite{haslhofer2011data}
+\subsection{ISLE/IMDI -- The Language Archive}
+\xne{IMDI}\furl{http://www.mpi.nl/imdi/} (\xne{EAGLES/ISLE Meta Data Initiative}) is an elaborate format for detailed descriptions of multi-media/multi-modal language resoruces developed within the corresponding project\cite{wittenburg2000eagles} 2000 to 2003.
+To serve the main goal of the project, easing access to language resources fostering the reuse, resource description in this new format were created for a number of collections and were made available via a dedicated \xne{IMDI browser}\furl{http://corpus1.mpi.nl/ds/imdi_browser/}, that allowed browsing the collection structure as well as complex advanced search over the deeply structured metadata. Also a metadata editor was developed for generating records in this format, with provisions for offline field-work and synchronization with the repository.
+The project lead and responsible for running the repository and whole infrastructure was the Technical Group at MPI for Psycholinguistics, who has engaged in a number of projects aimed at building a stable technical infrastructure for long-term archiving and work with language resources since its foundation (together with the Institute itself) in 1970s\furl{http://tla.mpi.nl/home/history/}. Recently, the group and the established infrastructure has been renamed to \xne{TLA -- The Language Archive}\furl{http://tla.mpi.nl/} ``Your partner for language data, tools and archiving'', where on one platform both the hoist of language resources and their description are preserved and provided as well as tools for working with this data is offered. The archive is also an aggregator itself, offering various collection from different (also external) projects (like DOBES, CGN, RELISH, etc.).
+IMDI can be seen as predecessor of CMDI, the team of the TG being the driving force behind the development of both. A \xne{imdi-session} profile, the corresponding IMDI to CMDI conversion
+as well as the transformed records were among the first to be added to the new CMD Infrastructure in 2010. The statistics
+of CMDI records list round 138.000 \xne{Session} records and round 13.000 \xne{imdi-corpus} records, modelling the collections for the sessions. Also, the metadata editor \xne{Arbil} was refactored to work with the new data model.
 \subsection{META-SHARE}
 \label{def:META-SHARE}
+Within the project META-SHARE format
+META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.
+%In cooperation between metadata teams from CLARIN and META-SHARE
+The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements.
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+\subsection{META-NET}
+META-SHARE was the subproject (2010-2013) of META-NET, a Network of Excellence consisting of 60 research centres from 34 countries, that covered the technical aspects.
 …
 META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
-META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
 \end{quotation}
+The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
+A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+Within the project META-SHARE a new metadata format was developed\cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a subset of core obligatory elements and with many optional components.
+%In cooperation between metadata teams from CLARIN and META-SHARE
+The original META-SHARE schema actually accomodates four models for different resource types. Consequently, the model has been expressed as 4 CMD profiles each for a distinct resource type however all four sharing most of the components, as can be seen in figure \ref{fig:resource_info_5}. The biggest single profile is currently the remodelled maximum schema from the META-SHARE project for describing corpora, with 117 distinct components and 337 elements. When expanded, this translates to 419 components and 1587 elements. However, many of the components and elements are optional (and conditional), thus a specific instance will never use all the possible elements. (See \ref{reports-meta-share} for more details about the format based on its integration into CMDI)
+The technical infrastructure of META-SHARE represents a distributed network of repositories consists of a number of member repositories, that offer their own subset of resource\furl{http://www.meta-share.eu/}.
+Selected member repositories\footnote{7 as of 2013-07}  play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+One point of criticism from the community was, the fact, that META-SHARE infrastructure does not provide any interface to the outer world, such as a OAI-PMH endpoint.
+? MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
 \subsection{ELRA}
+European Language Resources Association\furl{http://elra.info}
+European Language Resources Association\furl{http://elra.info} ELRA, offers a large collection of language resources, mostly under license for a fee, although some resources are available for free as well.
+The available datasets can be search for via ELRA Catalog\furl{http://catalog.elra.info/}
+Additionally ELRA runs the so-called \xne{Universal Catalog} -- a repository comprising information regarding Language Resources (LRs) identified all over the world.
 \begin{quotation}
+ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
+ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies.
+ELDA\furl{http://www.elda.org/} - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community.
+ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and
+drafts and concludes distribution agreements on behalf of ELRA.
 \end{quotation}
+http://www.elda.org/
+Evaluations and Language resources Distribution Agency
+ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns.
+ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
+ELRA Catalog
+http://catalog.elra.info/
+Universal Catalog+
+ Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
+\subsection{Other}
+OAI-ORE - is this a schema?
+\section{Ontologies, Controlled Vocabularies, Reference Data, Authority Files}
+\subsection{LDC}
+Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/} is another provider of high quality curated language resources
+\section{Formats and Collections in the World of Libraries}
+There are at least two reasons to concern ourselves with the developments in the world of Libraries and Information Systems (LIS): the long tradition implying rich experience and the fact, that almost all of the resources in the libraries are language resources. This argument gets even more relevant in the light of the efforts to digitize large portions of the material pursued in many (national) libraries in the last years (cf. discussion on Libraries partnering with Google). And given the amounts of data, even only the bibliographic records constitute sizable language resources in they own right.
+%\item[LoC] Library of Congress \url{http://www.loc.gov}
+%\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
+%\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
+%\end{description}
+\subsection{Formats  -- MARC, METS, MODS}
+There is a long tradition of standardized metadata formats in the world of Libraries and Information Systems (LIS), major role in the standardization being assumed for decades by the Library of Congress\furl{http://www.loc.gov/standards/}.
+The \xne{MARC}\furl{www.loc.gov/marc/} set of formats (being used since 1970s ) ``are standards for the representation and communication of bibliographic and related information in machine-readable form''. A number of variants developed over the years, the most widely spread is \xne{MARC 21} since 1999 -- is the standard format used for communication among libraries around the world.
+MARC 21 consists of 5 ``communication formats'' for specific types of data (Bibliographic, Authority Data, Holdings Data, Classification, and Community Information), are widely used standards for the representation and exchange of bibliographic, authority, holdings, classification, and community information data in machine-readable form. In 2002, the Library of Congress developed the \xne{MARCXML} schema for representing MARC records in XML;
+\xne{METS -- Metadata Encoding and Transmission Standard} - a format from the family of Library of Congress standards (since 2001) for encoding descriptive, administrative, and structural metadata regarding objects within a digital library.
+It is dedicated primarily to capture the structure of the digital objects, ``record the various relationships that exist between pieces of content, and between the content and metadata that compose a digital library object'' \cite{mets2010manual}.
+A METS record acts as a flexible container that accomodates other pieces of data (different levels of metadata and encoded objects themselves or references to those) in external formats\furl{http://www.loc.gov/standards/mets/mets-extenders.html}.
+Number of tools have been developed to author and process \xne{METS} format\furl{http://www.loc.gov/standards/mets/mets-tools.html} and numerous projects (online editions, DAM systems) use METS for structuring and recording the data\footnote{\url{http://www.loc.gov/standards/mets/mets-registry.html} though seems rather outdated} among others also \xne{austrian literature online}\furl{http://www.loc.gov/standards/mets/mets-registry.html}
+Metadata Object Description Schema - ``is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications''. It is a simplified subset of MARC 21 using  language-based tags rather than numeric ones,
+more than Dublin Core. One of endorsed schemas to extend (be used inside) METS.
+In 1998 a new  Entitiy Relationship model - FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
+and since ?? RDA - Resource Description and Access
+\subsection{ESE, Europeana Data Model - EDM}
+Within the big european initiative \xne{Europeana} (cf. \ref{lit:digi-lib}) information about digitised objects are collected from a great number of cultural institutions from all of Europe, currently
+originally developed and advised the common format \xne{ESE Europeana Semantic Elements}\furl{http://pro.europeana.eu/ese-documentation} a Dublin Core-based application profile\furl{www.europeana.eu/schemas/ese/ESE-V3.4.xsd}. Soon it became obvious, that this format is very limiting and work started on a Semantic Web compatible RDF-based format -- the Europeana Data Model EDM\furl{http://pro.europeana.eu/edm-documentation} \cite{isaac2012europeana, haslhofer2011data,doerr2010europeana}.
+EDM is fully compatible with ESE, which is (and will be) accepted from the providers. There is a SPARQL endpoint\furl{http://europeana.ontotext.com/sparql} to explore the semantic data of Europeana.
+%https://github.com/europeana
+%%%%%%%%%%%%%%%%%%
+\section{Controlled Vocabularies, Reference Data, Ontologies}
 \label{refdata}
+Based on popular demand, the work on reference data for the SSH-community should cover at least the following dimensions (with tentative denominations of corresponding existing vocabularies):
+\begin{itemize}
+\item Data Categories / Concepts - ISOcat
+\item Languages - ISO-639
+\item Countries - country codes
+\item Persons - GND, VIAF
+\item Organizations - GND, VIAF
+\item Schlagw\"{o}rter/Subjects - GND, LCSH
+\item Resource Typology -
+\end{itemize}
+AAT - international Architecture and Arts Thesaurus
+GND - Gemeinsame Norm Datei (GND ontology\furl{http://d-nb.info/standards/elementset/gnd}
+GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
+VIAF - Virtual International Authority File
+Other related relevant activities and initiatives
+http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values
+A broader collection of related initiatives can be found at the German National Library website:
+\furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html}
+FRBR - Functional Requirements for Bibliographic Records  2002 \cite{FRBR1998}
+RDA - Resource Description and Access
+http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
+At MPDL, within the escidoc publication platform there seems to be (work  on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities}
+Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC).
+http://eats.readthedocs.org/en/latest/
+\subsection{ISOcat - Data Category Registry}
+ISO12620
+\subsection{Classification Schemes, Taxonomies }
+LCSH, DDC
+\subsection{Other controlled Vocabularies}
+Language codes ISO-639-1
+\subsection{Domain Ontologies, Vocabularies}
+Organization-Lists
+LT-World !?
+\subsubsection{LT-World}
+One goal of this work being the groundwork for exposing the discussed dataset in the Semantic Web
+one preparatory task is to identify external semantic resources like controlled vocabularies or ontologies that the dataset could be linked with\footnote{Similar activity of inventarizing vocabularies and thesauri was conducted in the context of the \xne{Europeana} initiative
+\url{http://europeanalabs.eu/wiki/WP12Vocabularies}\url{https://europeanalabs.eu/wiki/DesignSemanticThesauri}}.
+Conceptually, we want to partition these resources in two types. On the one hand abstract concepts constituting all kinds of classifications, typologies, taxonomies. On the other hand named entities that exist(ed) in real world, like persons, organizations or geographical places. Main motivation for this distinction is the insight, that while for named entities there is (mostly) ``something'' in the (physical) world that gives a solid ground for equivalence relations between references from different sources (sameAs), for concepts we need to accept a plurality of existing conceptualizations and while we can (and have to) try to identify relations between them, the equivalence relation is inherently much weaker. This insight entails a partly different approach -- simply put, while we can aspire to create one large list/index encompassing all named entities, we have to maintain a forest of conceptual trees.
+In the following we inventarize such resources, covering the domains expected in the dataset. (Information about size of the dataset is meant rather as a rough indication of the "general weight" of the dataset, not necessarily a precise up to date information.) The acronyms in the tables are resolved in the subsequent glossary.
+How this resources will be employed is discussed in \ref{sec:values2entities}.
+%\subsubsection{Named entities}
+The largest controlled vocabularies of named entities are the authority files of (national) libraries. These are further aggregated into the so-called Virtual International Authority File, a huge resource, with entries from different authority files referring to the same entity being merged. This resource can be explored via a search interface and there is also a search service for applications.
+Other general large-scale resources are the vocabularies curated and provided by Getty Research Institute\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, however there is only a limited free access and licensed and fee for full access. But recently there work was announced to publish the vocabularies as LOD\furl{http://www.getty.edu/research/tools/vocabularies/lod/index.html}
+Yago is a large knowledge integrating dbpedia, geonames and ..??
 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
+So we witness a strong general trend towards Semantic Web and Linked Open Data.
+%Next to these ``global big players'' there are a number of other initiatives on different scale dedicated to a more specific domain.
+%Resources that contain different types of data (e.g. persons, places and classifications like GND or Yago) are divided and mentioned in individual tables by type.
+%\subsection{Concepts -- Classifications, Taxonomies, \dots}
+\begin{landscape}
+\begin{table}
+\caption{Controlled vocabularies of named entities -- Persons, Organizations, Works, Language Names, Geographica}
+\label{table:data-ne}
+%  \begin{tabu}{  p{0.2\textwidth}  p{0.2\textwidth}  p{0.2\textwidth}   p{0.2\textwidth}   p{0.2\textwidth} }
+  \begin{tabu}{  >{\sffamily}l l r X X}
+    \hline
+\rowfont{\itshape\small} name & provider & size (items / facts)  & description & access \\
+    \hline
+VIAF & OCLC + NatLibs & $\gg$ 1E7 & union of national authority files & search service, search app \\
+GND/p & DNB & 4.6E6 & Persons, universal, lang:de  & \href{http://d-nb.info/standards/elementset/gnd}{GND ontology}\\
+GND/k & '' & 1.2E6 & Organizations, universal, lang:de  & \\
+GND/w & '' & 193,000 & Works, lang:de  & \\
+GND/g & '' & 293.000 & Geographica, lang:de & \\
+ULAN & Getty & 202,720 / 638,900 & persons, artists     & \\
+TGN & Getty & 992.310 / 1.7E6 & also historical place names & \href{http://www.getty.edu/research/tools/vocabularies/index.html}{web search} \\
+%CONA & Getty & & records for cultural works & \\
+dbpedia & Wikipedia & $\sim$ 4E6 & all kinds of entities in up to 111 langs & \href{http://wiki.dbpedia.org/Downloads}{data dumps}, \href{http://dbpedia-live.openlinksw.com/sparql}{live SPARQL endpoint} \\
+& & \multicolumn{3}{l}{764,000 persons; 333,000 works; 192,000 organizations; 639,000 geographica } \\
+Yago \cite{Suchanek2007yago} & MPI Informatik & 1E7 / 1.2E8 & huge semantic KB (aggregated from Wikipedia, Wordnet, Geonames) & \href{http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html}{data dumps} \\
+\href{http://lt-world.de}{LT-World} & DFKI & 3.300 persons, 4.600 organizations & ontology-based portal for Language Technology & \href{http://www.lt-world.org/kb/}{portal} \\
+Geonames & Geonames & \textgreater 1E7 (2.8E6 / 5.5E6) & "modern" place names & data dump + web service \\
+PKND     & prometheus & \textgreater 37,000 & persons, artists & \href{http://prometheus-bildarchiv.de/de/tools/pknd}{XML dump} \\
+\href{http://gazetteer.dainst.org/}{iDAI.gazetteer} & DAI &  & archaeologically relevant places & search interface \\
+%Pelagios & AIT & 25 datasets & search over 25 datasets of archeologically relevant places & API\furl{https://github.com/pelagios/pelagios-cookbook/wiki/Using-the-Pelagios-API} \\
+\href{http://pleiades.stoa.org}{Pleiades} & & 34.000 & A community-built gazetteer and graph of ancient places & CSV, KML and RDF data dumps \\
+LCCN & LoC & \textgreater 1.2E7 & identifier for bibliographic records & \href{http://authorities.loc.gov/}{search service}, search app \\
+ISO 3166 & ISO & 249 & Official country codes, lang: en, fr &   \\
+ISO-639-1& ISO & 185 & basic language codes & \href{http://www.loc.gov/standards/iso639-2/php/English_list.php}{static list} \\
+ISO-639-3 & SIL & $\sim$ 7.679 & 3-letter code for every human language & \href{http://www-01.sil.org/iso639-3/}{view/download} \\
+CLAVAS & CLARIN & 2.500  & organization names extracted from CMD records & \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
+\hline
+\end{tabu}
+\end{table}
+\begin{comment}
+\hline
+  \end{tabu}
+\end{table}
+\begin{table}
+\caption{Controlled vocabularies of named entities -- Geographica}
+\label{table:data-ne-places}
+%  \begin{tabu}{  p{0.2\textwidth}  p{0.2\textwidth}  p{0.2\textwidth}   p{0.2\textwidth}   p{0.2\textwidth} }
+  \begin{tabu}{  >{\sffamily}l l r X X}
+    \hline
+\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
+\end{comment}
+\begin{table}
+\caption{Taxonomies, Classifications, Thesauri}
+\label{table:data-concepts}
+  \begin{tabu}{  >{\sffamily}l l r X X}
+    \hline
+\rowfont{\itshape\small} name & provider & size (items / facts) & description & access \\
+    \hline
+AAT & Getty & \href{http://www.getty.edu/research/tools/vocabularies/aat/aat_faq.html}{34,880 / 245,530} & subjects in  art and architecture &  \\
+LCSH & LoC &  & subjects, universal & \href{http://fast.oclc.org/searchfast/}{FAST} (Faceted Application of Subject Terminology), \href{http://experimental.worldcat.org/fast/}{Linked Data FAST} \\
+LCC  & LoC & & universal hierarchical classification & web app: \href{http://classificationweb.net/}{classification web} \\
+GND/s & DNB & 202.000 & subjects (SchlagwÃ¶rter), universal, lang:de & \\
+GTAA & NISL & 3.800 & Subjects, describing TV programs & \href{http://datahub.io/de/dataset/gemeenschappelijke-thesaurus-audiovisuele-archieven}{(RDF) data dumps}, \href{https://openskos.meertens.knaw.nl/}{OpenSKOS} -- search service \\
+DDC & OCLC & & universal classification by field of study, translated in multiple languages & \href{http://dewey.info/}{dewey.info} \\
+UDC & & & & \\
+Wiki Categories & Wikipedia & 995,911& classification of Wiki articles as skos:Concepts & SKOS Vocabulary, SPARQL \\
+ DBpedia Ontology & Wikipedia & 529 / 2333 & general classification of Wiki articles as ontology & \href{http://wiki.dbpedia.org/Ontology39?v=g9b}{RDF data}, SPARQL\\
+ISOcat & (CLARIN) & \textgreater 6,500 & data categories defining (linguistic) concepts in a number of thematic groups (Metadata, Lexical Resources, ...) & \href{http://www.isocat.org}{web-app}, service \\
+Object Names Thesaurus & British Museum & &  classification of objects in the collection & \\
+Material Thesaurus & British Museum & & classification of material & \\
+Thesaurus of Monument Types & British Museum & & types of monuments & \\
+Hornbostel-Sachs-Systematik & & 300 categories & classification of musical instruments & \href{http://www.music.vt.edu/musicdictionary/texth/Hornbostel-Sachs.html}{web page} \\
+Oberbegriffsdatei  & DMB & & a set of vocabularies for museums, lang:de  & \url{museumsvokabular.de}, PDF, XML dumps\\
+Iconclass & RKD & 28,000 & taxonomy of subject of an image &  \href{http://iconclass.org/data/iconclass.20121019.nt.gz}{RDF dump} \\
+\href{http://dirt.projectbamboo.org/}{DiRT} & Project Bamboo & 32 categories & taxonomy of research tools (1,200 tools)  &  \\
+%Scholarly Methods Taxonomy & DARIAH & 100 & research activities in a 2-level hierarchy and brief scope notes & in preparation \\
+\hline
+\end{tabu}
+\end{table}
+\end{landscape}
 \begin{description}
+\item[LDC]  Linguistic Data Consortium\furl{http://www.ldc.upenn.edu/}
+\item[OTA LR] Archiving Service provided by Oxford Text Archive \url{http://ota.oucs.ox.ac.uk/}
+\item[AAT] international Architecture and Arts Thesaurus, Getty
+\item[CONA] Cultural Objects Name Authority
+\item[DAI] Deutsches ArchÃ€ologisches Institut
+\item[DDC] Dewey Decimal Classification
+\item[DFKI] Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intellligenz
+\item[DMB] Deutscher Museumsbund
+\item[DNB] Deutsche National Bibliothek
+\item[FAST] Faceted Application of Subject Terminology
+\item[Getty] Getty Research Institute curating the vocabularies\furl{http://www.getty.edu/research/tools/vocabularies/index.html}, part of Getty Trust
+\item[GND] \emph{Gemeinsame Norm Datei} - Integrated authority Files of the German National Library
+\item[GTAA] Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)
+\begin{quotation} The thesaurus consists of several facets for describing TV programs: subjects; people mentioned; named entities (Corporation names, music bands etc); locations; genres; makers and presentators. \end{quotation}
+\item[ISO] International Standardization Organization
+\item[LCCN] Library of Congress Control Number
+\item[LCC] Library of Congress Classification
+\item[LCSH] Library of Congress Subject Headings
+\item[LoC] Library of Congress\furl{http://loc.gov}
+\item[OCLC] Online Computer Library Center\furl{http://www.oclc.org} -- world's biggest library federation
+\item[PKND] prometheus KÃŒnstlerNamensansetzungsDatei\furl{http://prometheus-bildarchiv.de/de/tools/pknd}
+\item[RKD] Rijksbureau voor Kunsthistorische Documentatie -- Netherlands Institute for Art History
+\item[TGN] Getty Thesaurus of Geographic Names
+\item[UDC] Universal Decimal Classification
+\item[ULAN] Union List of Artist Names
+\item[VIAF] Virtual International Authority File -- union of the authority files of \textgreater 20 national (and prominent research) libraries
 \end{description}
+\section{Other Metadata Catalogs/Collections}
+\label{sec:other-md-catalogs}
+\subsection{(Digital) Libraries}
+General (Libraries, Federations):
+\begin{description}
+\item[OCLC] \url{http://www.oclc.org}
+    world's biggest Library Federation
+\item[LoC] Library of Congress \url{http://www.loc.gov}
+\item[EU-Lib] European Library \url{http://www.theeuropeanlibrary.org/portal/organisation/handbook/accessing-collections\_ en.htm}
+\item[europeana] virtual European library - cross-domain portal \url{http://www.europeana.eu/portal/}
+\end{description}
+\begin{comment}
 VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
+http://www.dnb.de/rdf
+\subsection{schema.org}
+http://schema.org/docs/datamodel.html
+http://www.w3.org/wiki/WebSchemas/ExternalEnumerations
+microdata or
+http://www.w3.org/TR/rdfa-lite/
+ Resource Description Framework in attributes
 the entire WorldCat cataloging collection made publicly
 …
 Web crawlers such as Google and Bing
+\subsection{schema.org}
+http://schema.org/docs/datamodel.html
+microdata or
+http://www.w3.org/TR/rdfa-lite/
+ Resource Description Framework in attributes
+\end{comment}
 \section{Summary}
+In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology
+In this chapter, we gave an overview of the existing formats and datasets in the broad context of Language Resources and Technology.
+We also gave an overview of main formats and collections in the domain of Library and Information Services and a inventory of existing controlled vocabularies for named entities and concepts (taxonomies, classifications).

SMC4LRT/chapters/Literature.tex

-                      r3680
+                      r3681
 \subsubsection{Digital Libraries}
+\label{lit:digi-lib}
 In a broader view we should also regard the activities in the world of libraries.
 …
 \xne{The European Library}\furl{http://www.theeuropeanlibrary.org/tel4/} offers a search interface over more than 18 million digital items and almost 120 million bibliographic records from 48 National Libraries and leading European Research Libraries.
+\xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} has even broader scope, serving as meta-aggregator and portal for European digitised works, encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana). The auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
+\xne{Europeana}\furl{http://www.europeana.eu/} \cite{purday2009think} is a cultural heritage initiative with even broader scope, serving as ``meta-aggregator and portal for European digitised works'', encompassing material not just from libraries, but also museums, archives and all other kinds of collections (In fact, The European Library is the \emph{library aggregator} for Europeana).
+A large number of projects contribute(d) to Europeana. E.g. the auxiliary project \xne{EuropeanaConnect}\furl{http://www.europeanaconnect.eu/} (2009-2011) delivered the core technical components for Europeana as well as further services reusable in other contexts, e.g. the spatio-temporal browser \xne{GeoTemCo}\furl{https://github.com/stjaenicke/GeoTemCo} \cite{janicke2013geotemco}.
 Most recently, with \xne{Europeana Cloud}\furl{http://pro.europeana.eu/web/europeana-cloud} (2013 to 2015) a succession of \xne{Europeana} was established, a Best Practice Network, coordinated by The European Library, designed to establish a cloud-based system for Europeana and its aggregators, providing new content, new metadata, a new linked storage system, new tools and services for researchers and a new platform - Europeana Research.

SMC4LRT/chapters/Results.tex

-                      r3680
+                      r3681
 \end{table}
+DBNL\_Tekst clarin.eu:cr1:p\_1361876010678,
+clarin.eu:cr1:p 1366279029218 (private)
+%DBNL\_Tekst clarin.eu:cr1:p\_1361876010678, clarin.eu:cr1:p 1366279029218 (private)
+%
 \subsubsection{META-SHARE}
+%
+\label{reports-meta-share}
 META-SHARE created a new metadata model \cite{Gavrilidou2012meta}. Although inspired by the Component Metadata, META-SHARE metadata imposes a single large schema for all resource types with a minimal core subset of obligatory metadata elements and with many optional components.

SMC4LRT/utils.tex

-                      r3680
+                      r3681
 \usepackage{tabularx}
 \usepackage{tabu}
+\usepackage{pdflscape}
 \usepackage[singlelinecheck=off]{caption}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 3681

Legend:

SMC4LRT/Outline.tex

SMC4LRT/chapters/Data.tex

SMC4LRT/chapters/Literature.tex

SMC4LRT/chapters/Results.tex

SMC4LRT/utils.tex

Download in other formats: