Context Navigation

← Previous Change
Next Change →

Data.tex

Timestamp:

10/04/13 22:47:37 (11 years ago)

Author:

vronk

Message:

adding Schema Matching info and application

File:

: 1 edited

SMC4LRT/chapters/Data.tex (modified) (12 diffs)

Legend:

: Unmodified
: Added
: Removed

SMC4LRT/chapters/Data.tex

-                      r3671
+                      r3680
 \section{Other Metadata Formats and Collections }
+\label{sec:lrt-md-catalogs}
 …
+\subsection{Dublin Core metadata terms + OLAC}
+Since 1995
+Maintained Dublin Core Metadata Initiative
+DC, OLAC
+"Dublin" refers to Dublin, Ohio, USA where the work originated during the 1995 invitational OCLC/NCSA Metadata Workshop,[8] hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA).
+comes in two version: 15 core elements  and 55 qualified terms ?
+\begin{quotation}
+Early Dublin Core workshops popularized the idea of "core metadata" for simple and generic resource descriptions. The fifteen-element "Dublin Core" achieved wide dissemination as part of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and has been ratified as IETF RFC 5013, ANSI/NISO Standard Z39.85-2007, and ISO Standard 15836:2009.
+\end{quotation}
+Given its simplicity it is used as the common denominator in many applications, among others it is the base format in the OAI-PMH protocol.
+It is required/expected as the base
+openarchives register: \url{http://www.openarchives.org/Register/BrowseSites}
+OAI-repositories
+DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
+DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/}
+\subsection{Dublin Core metadata terms}
+The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in  Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
+It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
+\begin{description}
+\item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/}
+the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
+\item[Dublin Core metadata terms ] \code{/terms/}
+the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
+\end{description}
+Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
+There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
+Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
+The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
+\subsection{OLAC}
 \label{def:OLAC}
+\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format\cite{Bird2001},OLAC \cite{Simons2003OLAC} is a more specialized version of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community:
+\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
+The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
 \begin{quotation}
 …
 \end{quotation}
+The \xne{OLAC Metadata} is the set of metadata elements archives participating in have agreed to use for describing language resources.
+\todoin{check http://www.language-archives.org/OLAC/metadata.html}
+ OLAC Archives contain over 100,000 records, covering resources in half of the world's living languages. More statistics on coverage.
+http://www.language-archives.org/
+Most of the OLAC records are integrated into CMDI (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC})
+\lstset{language=XML}
+\begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record]
+<olac:olac>
+   <creator>Bloomfield, Leonard</creator>
+   <date>1933</date>
+   <title>Language</title>
+   <publisher>New York: Holt</publisher>
+</olac:olac>
+\end{
+OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
+Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
 \subsection{TEI / teiHeader}
 …
 \begin{quotation}
 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.
+The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/}
 \end{quotation}
+\url{http://www.tei-c.org/}
+TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs.
+Thus there is also not just one fixed \xne{teiHeader}.
+ TEI/teiHeader/ODD,
+encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
+\begin{quotation}
+ The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
+\ebnd{quotation}
+TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
+Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
+There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
 \subsection{ISLE/IMDI}
 …
 http://www.mpi.nl/imdi/
+\begin{quotation}
 The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
+\end{quotation}
+\subsection{LAT, TLA}
+Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
 Predecessor of CMDI
 …
 Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.
 \subsection{ESE, Europeana Data Model - EDM}
 …
+\subsection{META-NET}
+\begin{quotation}
+META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
+META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
+\end{quotation}
+The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
+A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+\subsection{ELRA}
+European Language Resources Association\furl{http://elra.info}
+\begin{quotation}
+ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
+\end{quotation}
+http://www.elda.org/
+Evaluations and Language resources Distribution Agency
+ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns.
+ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
+ELRA Catalog
+http://catalog.elra.info/
+Universal Catalog+
+ Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
 \subsection{Other}
 …
 \item Persons - GND, VIAF
 \item Organizations - GND, VIAF
 \item SchlagwÃ¶rter/Subjects - GND, LCSH
+\item Schlagw\"{o}rter/Subjects - GND, LCSH
 \item Resource Typology -
 \end{itemize}
 …
 Other related relevant activities and initiatives
+http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values
 A broader collection of related initiatives can be found at the German National Library website:
 \furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html}
 …
 http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
 At MPDL, within the escidoc publication platform there seems to be (work  on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities}
 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities â developed at the New Zealand Electronic Text Centre (NZETC).
+Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC).
 http://eats.readthedocs.org/en/latest/
 …
 \subsubsection{LT-World}
+Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
+\section{LRT Metadata Catalogs/Collections}
+\label{sec:lrt-md-catalogs}
+\todoin{Overview of catalogs, name, since, \#providers, \#resources}
+\todoin{[DFKI/LT-World]  - collection or ontology}
+\subsection{CMDI}
+collections, profiles/Terms, ResourceTypes!
+\subsection{OLAC}
+\subsection{LAT, TLA}
+Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
+\subsection{META-NET}
+\begin{quotation}
+META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
+META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
+\end{quotation}
+The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
+A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
+The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
+MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
+\subsection{ELRA}
+European Language Resources Association
+\furl{http://elra.info}
+ELRAâs missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
+http://www.elda.org/
+Evaluations and Language resources Distribution Agency
+ELDA - Evaluations and Language resources Distribution Agency â is ELRAâs operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT â Human Language Technology â community. Besides, ELDA is involved in HLT evaluation campaigns.
+ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
+ELRA Catalog
+http://catalog.elra.info/
+Universal Catalog+
+ Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
+\subsection{Other}
+Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
 …
+VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
+http://www.dnb.de/rdf
+the entire WorldCat cataloging collection made publicly
+available using Schema.org mark-up with library extensions for use by developers and
+search partners such as Bing, Google, Yahoo! and Yandex
+OCLC begins adding linked data to WorldCat by appending
+Schema.org descriptive mark-up to WorldCat.org pages, thereby
+making OCLC member library data available for use by intelligent
+Web crawlers such as Google and Bing
+\subsection{schema.org}
+http://schema.org/docs/datamodel.html
+microdata or
+http://www.w3.org/TR/rdfa-lite/
+ Resource Description Framework in attributes
 \section{Summary}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 3680 for SMC4LRT/chapters/Data.tex

Legend:

SMC4LRT/chapters/Data.tex

Download in other formats: