Changeset 3680 for SMC4LRT/chapters/Data.tex
- Timestamp:
- 10/04/13 22:47:37 (11 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
SMC4LRT/chapters/Data.tex
r3671 r3680 132 132 133 133 \section{Other Metadata Formats and Collections } 134 \label{sec:lrt-md-catalogs} 134 135 135 136 … … 139 140 140 141 141 \subsection{Dublin Core metadata terms + OLAC} 142 Since 1995 143 Maintained Dublin Core Metadata Initiative 144 DC, OLAC 145 146 "Dublin" refers to Dublin, Ohio, USA where the work originated during the 1995 invitational OCLC/NCSA Metadata Workshop,[8] hosted by the Online Computer Library Center (OCLC), a library consortium based in Dublin, and the National Center for Supercomputing Applications (NCSA). 147 148 comes in two version: 15 core elements and 55 qualified terms ? 149 150 \begin{quotation} 151 Early Dublin Core workshops popularized the idea of "core metadata" for simple and generic resource descriptions. The fifteen-element "Dublin Core" achieved wide dissemination as part of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and has been ratified as IETF RFC 5013, ANSI/NISO Standard Z39.85-2007, and ISO Standard 15836:2009. 152 \end{quotation} 153 154 155 156 Given its simplicity it is used as the common denominator in many applications, among others it is the base format in the OAI-PMH protocol. 157 158 It is required/expected as the base 159 openarchives register: \url{http://www.openarchives.org/Register/BrowseSites} 160 2006 OAI-repositories 161 162 DublinCore Resource Types\furl{http://dublincore.org/documents/resource-typelist/} 163 164 DublinCore to RDF mapping\furl{http://dublincore.org/documents/dcq-rdf-xml/} 165 142 \subsection{Dublin Core metadata terms} 143 The work on this metadata format started in 1995 at Metadata Workshop\furl{http://dublincore.org/workshops/dc1/} organized by OCLC/NCSA in Dublin, Ohio, USA. Nowadays maintained by Dublin Core Metadata Initiative. 144 145 It is a fixed set of terms for a basic generic description of a range of resources (both virtual and physical) coming in two version\furl{http://dublincore.org/documents/dcmi-terms/}: 146 \begin{description} 147 \item[Dublin Core Metadata Element Set (DCMES) ] \code{/elements/1.1/} 148 the original set 15 terms, standardized as IETF RFC 5013, ISO Standard 15836-2009 and NISO Standard Z39.85-2007 149 \item[Dublin Core metadata terms ] \code{/terms/} 150 the extended `Qualified' set of 55 terms, extending the original 15 ones (replicating them in the new namespace for consistency) 151 \end{description} 152 153 Today, Dublin Core metadata terms is very widely spread. Thanks to its simplicity it is used as the common denominator in many applications, content management systems integrate Dublin Core to use in \code{meta} tags of served pages (\code{<meta name="DC.Publisher" content="publisher-name" >}), it is default minimal description in content repositories (Fedora-commons, DSpace). It is also the obligatory base format in the OAI-PMH protocol. The OpenArchives register\furl{http://www.openarchives.org/Register/BrowseSites} lists more than 2100 data providers. 154 155 There are multiple possible serializations, in particular a mapping t RDF is specified\furl{http://dublincore.org/documents/dcq-rdf-xml/}. 156 Worth noting is Dublin Core's take on classification of resources\furl{http://dublincore.org/documents/resource-typelist/}. 157 158 The simplicity of the format is also it's main drawback when considered as metadata format in the research communities. It it too general to capture all specific details, individual research groups need to describe different kinds of resources with. 159 160 \subsection{OLAC} 166 161 \label{def:OLAC} 167 162 168 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format\cite{Bird2001},OLAC \cite{Simons2003OLAC} is a more specialized version of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community: 163 \xne{OLAC Metadata}\furl{http://www.language-archives.org/}format \cite{Bird2001} is a application profile\cite{heery2000application}, of the \xne{Dublin Core metadata terms}, adapted to the needs of the linguistic community. It is developed and maintained by the \xne{Open Language Archives Community} providing a common platform and an infrastructure for ``creating a worldwide virtual library of language resources'' \cite{Simons2003OLAC}. 164 165 The OLAC schema \furl{http://www.language-archives.org/OLAC/1.1/olac.xsd} extends the dcterms schema mainly by adding attributes with controlled vocabularies, for domain specific semantic annotation (\code{linguistic-field, linguistic-type, language, role, discourse-type}) 169 166 170 167 \begin{quotation} … … 172 169 \end{quotation} 173 170 174 The \xne{OLAC Metadata} is the set of metadata elements archives participating in have agreed to use for describing language resources. 175 176 \todoin{check http://www.language-archives.org/OLAC/metadata.html} 177 178 OLAC Archives contain over 100,000 records, covering resources in half of the world's living languages. More statistics on coverage. 179 http://www.language-archives.org/ 180 181 Most of the OLAC records are integrated into CMDI (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}) 182 171 \lstset{language=XML} 172 \begin{lstlisting}[label=lst:sampleolac, caption=Sample OLAC record] 173 <olac:olac> 174 <creator>Bloomfield, Leonard</creator> 175 <date>1933</date> 176 <title>Language</title> 177 <publisher>New York: Holt</publisher> 178 </olac:olac> 179 \end{ 180 181 OLAC provides a ``search over 100,000 records collected from 44 archives\furl{http://www.language-archives.org/archives}, covering resources in half of the world's living languages''. 182 183 Note, that OLAC archives are being harvested by CLARIN harvester and OLAC records are part of the CMDI joint metadata domain (cf. \ref{tab:cmd-profiles}, \ref{reports:OLAC}). 183 184 184 185 \subsection{TEI / teiHeader} … … 186 187 187 188 \begin{quotation} 188 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. 189 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form.\furl{http://www.tei-c.org/} 189 190 \end{quotation} 190 191 \url{http://www.tei-c.org/} 192 193 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. 194 195 Thus there is also not just one fixed \xne{teiHeader}. 196 197 TEI/teiHeader/ODD, 198 199 191 encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. 192 193 \begin{quotation} 194 The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form \dots [Next to] its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics, \dots the Consortium provides a variety of TEI-related resources, training events and software. [abgridged] 195 \ebnd{quotation} 196 197 TEI is a de-facto standard for encoding any kind of digital textual resources being developed by a large community since 1994. It defines a set of elements to annotate individual aspects of the text being encoded. For the purposes of text description, metadata encoding (of main concern for us) the complex top-level element \code{teiHeader} is foreseen. TEI is not prescriptive, but rather descriptive, it does not provide just one fixed schema, but allows for a certain flexibility wrt to elements used and inner structure, allowing to generate custom schemas adopted to projects' needs. Thus there is also not just one fixed \code{teiHeader}. 198 199 Some of the data collections encoded in TEI are die Korpora des DWDS\furl{http://www.dwds.de}, Deutsches Textarchiv\furl{http://www.dwds.de/dta} \cite{Geyken2011deutsches}, Oxford Text Archives\furl{http://ota.oucs.ox.ac.uk/} 200 201 There has been an intense cooperation between the TEI and CMDI community on the issue of interoperability and multiple efforts to express teiHeader in CMDI were undertaken (cf. \ref{results:tei}) as a starting point for integrating TEI-based data into the CLARIN infrastructure. 200 202 201 203 \subsection{ISLE/IMDI} … … 204 206 http://www.mpi.nl/imdi/ 205 207 208 \begin{quotation} 206 209 The ISLE Meta Data Initiative (IMDI) is a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools. 210 \end{quotation} 211 212 213 \subsection{LAT, TLA} 214 Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}} 215 207 216 208 217 Predecessor of CMDI … … 213 222 214 223 Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. 224 225 215 226 216 227 \subsection{ESE, Europeana Data Model - EDM} … … 245 256 246 257 258 259 \subsection{META-NET} 260 261 262 263 \begin{quotation} 264 META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role. 265 266 META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.). 267 268 \end{quotation} 269 270 The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource. 271 272 A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. 273 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). 274 275 276 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 277 278 \subsection{ELRA} 279 280 European Language Resources Association\furl{http://elra.info} 281 282 \begin{quotation} 283 ELRA's missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section: 284 \end{quotation} 285 286 http://www.elda.org/ 287 Evaluations and Language resources Distribution Agency 288 289 ELDA - Evaluations and Language resources Distribution Agency -- is ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Besides, ELDA is involved in HLT evaluation campaigns. 290 291 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA. 292 293 ELRA Catalog 294 295 http://catalog.elra.info/ 296 297 298 Universal Catalog+ 299 Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world. 300 301 247 302 \subsection{Other} 248 303 … … 262 317 \item Persons - GND, VIAF 263 318 \item Organizations - GND, VIAF 264 \item Schlagw örter/Subjects - GND, LCSH319 \item Schlagw\"{o}rter/Subjects - GND, LCSH 265 320 \item Resource Typology - 266 321 \end{itemize} … … 274 329 Other related relevant activities and initiatives 275 330 331 http://www.w3.org/wiki/WebSchemas/ExternalEnumerations#Controlled_property_values 332 276 333 A broader collection of related initiatives can be found at the German National Library website: 277 334 \furl{http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html} … … 281 338 http://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011) 282 339 At MPDL, within the escidoc publication platform there seems to be (work on) a service (since 2009 !) for controlled vocabularies: \furl{http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities} 283 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities âdeveloped at the New Zealand Electronic Text Centre (NZETC).340 Entity Authority Tool Set - a web application for recording, editing, using and displaying authority information about entities -- developed at the New Zealand Electronic Text Centre (NZETC). 284 341 http://eats.readthedocs.org/en/latest/ 285 342 … … 303 360 304 361 \subsubsection{LT-World} 305 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{\textit{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz} - \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} 306 307 308 309 \section{LRT Metadata Catalogs/Collections} 310 \label{sec:lrt-md-catalogs} 311 \todoin{Overview of catalogs, name, since, \#providers, \#resources} 312 313 \todoin{[DFKI/LT-World] - collection or ontology} 314 315 \subsection{CMDI} 316 collections, profiles/Terms, ResourceTypes! 317 318 \subsection{OLAC} 319 320 \subsection{LAT, TLA} 321 Language Archiving Technology, now The Language Archive - provided by Max Planck Insitute for Psycholinguistics \footnote{\url{http://www.mpi.nl/research/research-projects/language-archiving-technology}} 322 323 \subsection{META-NET} 324 325 326 327 \begin{quotation} 328 META-SHARE is an open, integrated, secure and interoperable sharing and exchange facility for LRs (datasets and tools) for the Human Language Technologies domain and other applicative domains where language plays a critical role. 329 330 META-SHARE is implemented in the framework of the META-NET Network of Excellence. It is designed as a network of distributed repositories of LRs, including language data and basic language processing tools (e.g., morphological analysers, PoS taggers, speech recognisers, etc.). 331 332 \end{quotation} 333 334 The distributed networks of repositories consists of a number of member repositories, that offer their own subset of resource. 335 336 A few\footnote{7 as of 2013-07} of the members repositories play the role of managing nodes providing ``a core set of services critical to the whole of the META-SHARE network''\cite{Piperidis2012meta}, especially collecting the resource descriptions from other members and exposing the aggregated information to the users. 337 The whole network offers approximately 2.000 resources (the numbers differ even across individual managing nodes). 338 339 340 MetaShare ontology\furl{http://metashare.ilsp.gr/portal/knowledgebase/TheMetaShareOntology} 341 342 343 344 \subsection{ELRA} 345 346 European Language Resources Association 347 348 \furl{http://elra.info} 349 350 351 ELRAâs missions are to promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies. To achieve these two major missions, we offer a range of services, listed below and described in the "Services around Language Resources" section: 352 353 354 http://www.elda.org/ 355 Evaluations and Language resources Distribution Agency 356 357 ELDA - Evaluations and Language resources Distribution Agency â is ELRAâs operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT â Human Language Technology â community. Besides, ELDA is involved in HLT evaluation campaigns. 358 359 ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA. 360 361 ELRA Catalog 362 363 http://catalog.elra.info/ 364 365 366 Universal Catalog+ 367 Universal Catalogue is a repository comprising information regarding Language Resources (LRs) identified all over the world. 368 369 370 \subsection{Other} 362 Regarding existing domain-specific semantic resources \texttt{LT-World}\footnote{\url{http://www.lt-world.org/}}, the ontology-based portal covering primarily Language Technology being developed at DFKI\footnote{Deutsches Forschungszentrum fÃŒr KÃŒnstliche Intelligenz, \url{http://www.dfki.de}}, is a prominent resource providing information about the entities (Institutions, Persons, Projects, Tools, etc.) in this field of study. \cite{Joerg2010} 363 364 371 365 372 366 … … 394 388 395 389 390 VoID "Vocabulary of Interlinked Datasets") is an RDF based schema to describe linked datasets\furl{http://semanticweb.org/wiki/VoID} 391 392 http://www.dnb.de/rdf 393 394 395 the entire WorldCat cataloging collection made publicly 396 available using Schema.org mark-up with library extensions for use by developers and 397 search partners such as Bing, Google, Yahoo! and Yandex 398 399 OCLC begins adding linked data to WorldCat by appending 400 Schema.org descriptive mark-up to WorldCat.org pages, thereby 401 making OCLC member library data available for use by intelligent 402 Web crawlers such as Google and Bing 403 404 405 \subsection{schema.org} 406 407 http://schema.org/docs/datamodel.html 408 409 microdata or 410 http://www.w3.org/TR/rdfa-lite/ 411 Resource Description Framework in attributes 412 396 413 397 414 \section{Summary}
Note: See TracChangeset
for help on using the changeset viewer.