wiki:CmdiRepository

(See also implementation notes for MDRepository)

CLARIN Metadata Repository

The metadata repository contains metadata descriptors harvested from the CLARIN metadata providers. It does not do any semantic interpretation of the metadata descriptions but offers an API that allows other software components to make search and retrieve the metadata descriptions.

A CLARIN metadata repository should use the OAI-PMH protocol to harvest the CLARIN metadata providers and stores the metadata records in the repository. The metadata providers that are harvested can be found in the CLARIN metadata provider registry [see Appendix ].

As a service for “other” metadata service providers the CLARIN metadata repository should implement its own metadata provider using OAI-PMH .

For efficient accessing the metadata repository services by software components within the same VM a java API should be provided. REST and SOAP API implementation should be available for remote software components.

All metadata descriptions are XML records, so it is natural to use XPath & XQuery as a specification format to search for specific descriptions.

Repository Statistics services

The CLARIN metadata repository offers also some basic functions concerning its holding. Here we think of: How many metadata records are available from the repository specified per metadata provider etc. When was the last time a particular provider was harvested. If any, how often did problems occur?

Configuration

Configuration options for the metadata repository should include (a) the registry specification for the CLARIN metadata providers list, (b) a (black) list of providers that can overrule the complete list in the registry (c) the harvesting frequency.

Implementation

For the metadata repository multiple implementations are possible. First we would like to investigate the performance of a native XML DB “eXist”. An XML DB is expected to be very well suited for our case where we have to deal with large amounts of XML records (100k -1M) complying with many different schemas (<1000).

Alternative Implementations

eXist Zebra/Z39.50/jzkit solr
query language XPath X-PQF(?), CQL superset of lucene-search
harvest ? OAI OAI(?)
performance/scalability ? "more than ten gigabytes of data, tens of millions of records" ?
document model any xml customisable doc/field (one table)
facetted browser no yes(?) yes

DC (Dublin Core) Indexes

The OAI-PMH specification demands that metadata providers always provide DC metadata records for every metadata record. For compatibility purposes we rely on the CLARIN metadata repository to provide a DC metadata service.

Logical sets

The OAI-PMH protocol (version 2) allows metadata providers to encode the existence of (hierarchical) sets of resources. The OAI “listSets” verb can be used to query the set structure of a repository from repositories offering hierarchical sets for browsing purposes. Unfortunately the OAI does not permit sub sequential querying of the set structure, the “lisSets” verb returns all the information at once.

Archives using IMDI metadata usually offer a browseable structure that can be expressed in such hierarchical sets and be conveyed in the “listSets” response.

The CLARIN metadata infrastructure also offers a method to specify collections or sets (see Appendix D. Hierarchical Collections and Sets).

Metadata Provenance

?

APIs

There are three classes of services:

  • Repository services: Storing and extracting metadata records
  • Statistics: Giving information about the harvesting history and the status of the repository
  • XML search services: Enable searching for specific metadata records based on the XML structure and content. This service does NOT make use of any semantic mapping or translation.

The different services of the Metadata Repository component and its interactyion with the other components is shown in fig. 1

Last modified 13 years ago Last modified on 11/14/10 18:25:45

Attachments (1)

Download all attachments as: .zip