Ignore:
Timestamp:
10/04/13 22:47:37 (11 years ago)
Author:
vronk
Message:

adding Schema Matching info and application

File:
1 edited

Legend:

Unmodified
Added
Removed
  • SMC4LRT/chapters/Data.tex

    r3671 r3680  
    132132
    133133\section{Other Metadata Formats and Collections }
     134\label{sec:lrt-md-catalogs}
    134135
    135136
     
    139140
    140141
    141 \subsection{Dublin Core metadata terms + OLAC}
    142 Since 1995
    143 Maintained Dublin Core Metadata Initiative
    144 DC, OLAC
    145 
    146 "Dublin" refers to Dublin, Ohio, USA where the work originated during the 1995 invitational OCLC/NCSA Metadata Workshop,[8] hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA).
    147 
    148 comes in two version: 15 core elements  and 55 qualified terms ?
    149 
    150 \begin{quotation}
    151 Early Dublin Core workshops popularized the idea of "core metadata" for simple and generic resource descriptions. The fifteen-element "Dublin Core" achieved wide dissemination as part of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and has been ratified as IETF RFC 5013, ANSI/NISO Standard Z39.85-2007, and ISO Standard 15836:2009.
    152 \end{quotation}
    153 
    154 
    155 
    156 Given its simplicity it is used as the common denominator in many applications, among others it is the base format in the OAI-PMH protocol.
    157 
    158 It is required/expected as the base
    159 openarchives register: \url{http://www.openarchives.org/Register/BrowseSites}
    160  2006 OAI-repositories
    161 
    162 DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/}
    163 
    164 DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/}
    165 
     142\subsection{Dublin Core metadata terms}
     143The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in  Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative.
     144
     145It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}:
     146\begin{description}
     147\item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/}
     148the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007
     149\item[Dublin Core metadata terms ] \code{/terms/}
     150the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency)
     151\end{description}
     152
     153Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers.
     154
     155There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}.
     156Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}.
     157
     158The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with.
     159
     160\subsection{OLAC}
    166161\label{def:OLAC}
    167162
    168 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format\cite{Bird2001},OLAC \cite{Simons2003OLAC} is a more specialized version of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community:
     163\xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}.
     164
     165The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type})
    169166
    170167\begin{quotation}
     
    172169\end{quotation}
    173170
    174 The \xne{OLAC Metadata} is the set of metadata elements archives participating in have agreed to use for describing language resources.
    175 
    176 \todoin{check http://www.language-archives.org/OLAC/metadata.html}
    177 
    178  OLAC Archives contain over 100,000 records, covering resources in half of the world's living languages. More statistics on coverage.
    179 http://www.language-archives.org/
    180 
    181 Most of the OLAC records are integrated into CMDI (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC})
    182 
     171\lstset{language=XML}
     172\begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record]
     173<olac:olac>
     174   <creator>Bloomfield, Leonard</creator>
     175   <date>1933</date>
     176   <title>Language</title>
     177   <publisher>New York: Holt</publisher>
     178</olac:olac>
     179\end{
     180
     181OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''.
     182
     183Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}).
    183184
    184185\subsection{TEI / teiHeader}
     
    186187
    187188\begin{quotation}
    188 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. 
     189The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/}
    189190\end{quotation}
    190 
    191 \url{http://www.tei-c.org/}
    192 
    193 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs.
    194 
    195 Thus there is also not just one fixed \xne{teiHeader}.
    196 
    197  TEI/teiHeader/ODD,
    198 
    199 
     191encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.
     192
     193\begin{quotation}
     194 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots  [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged]
     195\ebnd{quotation}
     196
     197TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}.
     198
     199Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/}
     200
     201There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure.
    200202
    201203\subsection{ISLE/IMDI}
     
    204206http://www.mpi.nl/imdi/
    205207
     208\begin{quotation}
    206209The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools.
     210\end{quotation}
     211
     212
     213\subsection{LAT, TLA}
     214Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
     215
    207216
    208217Predecessor of CMDI
     
    213222
    214223Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.
     224
     225
    215226
    216227\subsection{ESE, Europeana Data Model - EDM}
     
    245256
    246257
     258
     259\subsection{META-NET}
     260
     261
     262
     263\begin{quotation}
     264META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
     265
     266META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
     267
     268\end{quotation}
     269
     270The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
     271
     272A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
     273The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
     274
     275
     276MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
     277
     278\subsection{ELRA}
     279
     280European Language Resources Association\furl{http://elra.info}
     281
     282\begin{quotation}
     283ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
     284\end{quotation}
     285
     286http://www.elda.org/
     287Evaluations and Language resources Distribution Agency
     288
     289ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns.
     290
     291ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
     292
     293ELRA Catalog
     294
     295http://catalog.elra.info/
     296
     297
     298Universal Catalog+
     299 Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
     300
     301
    247302\subsection{Other}
    248303
     
    262317\item Persons - GND, VIAF
    263318\item Organizations - GND, VIAF
    264 \item Schlagwörter/Subjects - GND, LCSH
     319\item Schlagw\"{o}rter/Subjects - GND, LCSH
    265320\item Resource Typology -
    266321\end{itemize}
     
    274329Other related relevant activities and initiatives
    275330
     331http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values
     332
    276333A broader collection of related initiatives can be found at the German National Library website:
    277334\furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html}
     
    281338http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
    282339At MPDL, within the escidoc publication platform there seems to be (work  on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities}
    283 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities – developed at the New Zealand Electronic Text Centre (NZETC).
     340Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC).
    284341http://eats.readthedocs.org/en/latest/
    285342
     
    303360
    304361\subsubsection{LT-World}
    305 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
    306 
    307 
    308 
    309 \section{LRT Metadata Catalogs/Collections}
    310 \label{sec:lrt-md-catalogs}
    311 \todoin{Overview of catalogs, name, since, \#providers, \#resources}
    312 
    313 \todoin{[DFKI/LT-World]  - collection or ontology}
    314 
    315 \subsection{CMDI}
    316 collections, profiles/Terms, ResourceTypes!
    317 
    318 \subsection{OLAC}
    319 
    320 \subsection{LAT, TLA}
    321 Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}}
    322 
    323 \subsection{META-NET}
    324 
    325 
    326 
    327 \begin{quotation}
    328 META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role.
    329 
    330 META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.).
    331 
    332 \end{quotation}
    333 
    334 The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource.
    335 
    336 A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users.
    337 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes).
    338 
    339 
    340 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology}
    341 
    342 
    343 
    344 \subsection{ELRA}
    345 
    346 European Language Resources Association
    347 
    348 \furl{http://elra.info}
    349 
    350 
    351 ELRA’s missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section:
    352 
    353 
    354 http://www.elda.org/
    355 Evaluations and Language resources Distribution Agency
    356 
    357 ELDA - Evaluations and Language resources Distribution Agency – is ELRA’s operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT – Human Language Technology – community. Besides, ELDA is involved in HLT evaluation campaigns.
    358 
    359 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA.
    360 
    361 ELRA Catalog
    362 
    363 http://catalog.elra.info/
    364 
    365 
    366 Universal Catalog+
    367  Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world.
    368 
    369 
    370 \subsection{Other}
     362Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}},  the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}},  is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010}
     363
     364
    371365
    372366
     
    394388
    395389
     390VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID}
     391
     392http://www.dnb.de/rdf
     393
     394
     395the entire WorldCat cataloging collection made publicly
     396available using Schema.org mark-up with library extensions for use by developers and
     397search partners such as Bing, Google, Yahoo! and Yandex
     398
     399OCLC begins adding linked data to WorldCat by appending
     400Schema.org descriptive mark-up to WorldCat.org pages, thereby
     401making OCLC member library data available for use by intelligent
     402Web crawlers such as Google and Bing
     403
     404
     405\subsection{schema.org}
     406
     407http://schema.org/docs/datamodel.html
     408
     409microdata or
     410http://www.w3.org/TR/rdfa-lite/
     411 Resource Description Framework in attributes
     412
    396413
    397414\section{Summary}
Note: See TracChangeset for help on using the changeset viewer.