wiki:CmdiCollections

Version 1 (modified by broeder, 15 years ago) (diff)

--

Introduction

The CLARIN metadata infrastructure (CMDI) describes a model to specify collections of resources [1]. The basis of CLARIN collection specification is that we can use a metadata description as the incarnation of a collection. That metadata description should then serve as an interface through which all collection wide operations are performed on the collection’s resources.

This together with the CLARIN metadata model itself offers a number of options:

  1. Have a metadata description that points directly to all resources of the collection.
  2. Have a metadata description that points to other metadata descriptions that point either to another metadata description or to a resource.
  3. Have a combination of 1 and 2.

All three options are permissible and have each their specific advantages in different use cases. We should however try to identify “core” metadata that is essential for the collection metadata to function within different CLARIN usage scenarios. Therefore we will first now describe the different use cases that have been identified by the CLARIN community.

Usage of CLARIN collection metadata

Currently we have identified the following cases:

  1. Unique identification of collections
  2. Registry of collections for future use
  3. Citation & reference of collections
  4. Searching for relevant collections
  5. Browsing the internal collection structure
  6. Extension & modification of collections
  7. Obtaining access permission for all the collection’s resources or reporting where access permissions are not (automatically) available.
  8. Usage of collections in workflow scenarios
  1. Unique identification of collections

This depends on the adoption of a suitable identifier scheme that guaranties uniqueness and also on embedding the identifier in the collection metadata description. CLARIN accepts suitable identifier scheme for resources and metadata are described in “Persistent and Unique Identifiers” (http://www.clarin.eu/filemanager/active?fid=235) . It should be emphasized that also suitable stable URIs (Cool URIs) are permissible, however special services that may be developed on the basis of PID systems will not be available for these. OAI identifiers from the OAI header in OAI records should not be accepted, they are useful for harvesting metadata and usually the OAI identifier maps on a (persistent) identifier from another more suitable identifier scheme but that cannot be relied upon. For the sake of coherence we should require a suitable identifier being available in an <Identifier> element of the collection metadata itself. It is important to notice that for collections the identifier is interpreted as identifying the collection.

2 Registry of collection metadata Collections may result from search or browsing actions making them virtual collections (VC) rather than intended “published” collections. These VCs however may be required to be citable for future use. CLARIN therefore stated its intention to provide a registry for such collections, allowing researchers to store VC descriptions that can be referred to from documents or other resources. Such registries do not impose special requirements other than the registries should be persistent, and that the collection metadata should, for administrative reasons, be able to identify the registry. Depending on the usage of the collection provenance or journaling information how the collection came into existence can be required. This is already foreseen in the basic CLARIN component metadata model, see the JournalFileProxy? element in Figure 4 of [1].

  1. Citation and referencing Collections

First of course a persistent identifier is required that references the collection’s metadata. PID For citation purposes it seems suitable to have metadata that describes purpose of the collection and identifies who is responsible for the creation and registration (of the collection description, not the resources themselves).

  1. Metadata search.

Because VCs are “derived” collections and are of a different importance than published corpora (metadata is created by the resource providers), it should be possible to filter out the VCs from metadata search. So VCs should be recognizable as such (VC flag). But when purposefully searching for suitable VCs we need the possibility to search on “purpose” and “creator”. A suitable “purpose” vocabulary should be investigated.

We can expect that intended published collections

  1. Browsing the internal collection structure.

This implicates both VC and published collection metadata. The CLARIN metadata component model supports browsing the internal structure (corpus – subcorpus hierarchies), no special metadata is needed. However browsing tools should be aware of this and enable descending into this structure and enable display of further layers of metadata.

  1. Extension & modification of collections

This point is very much connected to the versioning policy that the collection’s resources providers implement. If a resource in a VC is modified without issuing a new PID, the VC’s PID will inherit the same versioning policy. It becomes worse if the VC contains resources from providers with different versioning policies. It should be made explicit from the VC metadata what versioning policy for the VC results. A suitable vocabulary for a versioning policy metadata field should be proposed.

  1. Authorization issues

It is important that a user of a VC or published collection is able to determine if he has access to all resources and if not, what procedure should be followed to make a request for access. For individual resources and published collection we already have a “strongly recommended” metadata component [1] with such information. A VC registry tool should exploit that information and inform a VC’s prospective user about current access status and the possible steps to take to obtain access.

  1. CLARIN workflow

In order to create interoperability with CLARIN workflow mechanisms collection metadata should (1) allow extension with processing results (bundle concept in [1]) and (2) allow processing modules to analyze the collection and obtain the individual resources and the resource’s technical metadata necessary for establishing suitability and processing. The intrinsic model satisfies both requirements provided every individual resource is covered by the metadata.

Attachments (2)

Download all attachments as: .zip