wiki:MDCuration

Version 2 (modified by vronk, 10 years ago) (diff)

--

Metadata Curation Taskforce

Primary concern of this taskforce is the quality of the CMD records (i.e. instances), however the quality of the schemas/profiles obviously plays a role.

Anecdotal Checks

Currently VLO features a link to report errors in metadata encountered during browsing.

This input is been propagated to a few persons, who either try to resolve it directly or generate a ticket in the metadata curation queue {18}.

Automatic Checks - Quality Assessement Service

Obvious need for systematic automatic checks of the data.

Possible aspects to check:

  1. schema validation (already performed by harvester?)
  2. valid links (in ResourceProxy?, but also in the record-body)
  3. presence of "required" data categories
  4. adherence to controlled vocabularies on selected fields/data categories.

Points 3 and 4 need to be configurable and need an agreement by community (or at least some representatives), what are the important data categories and what are the corresponding authoritative vocabularies.

Also, the check has to be able to operate on the two levels schema and instance, i.e. be able to assess the quality of a profile as well as of an instance. (And it should take the result of a profile assessment into the assessment of the corresponding instance.) Different aspects apply to these two levels (links are not present on the schema level, but data categories, etc.)

The processing should ideally be encapsulated as a service, that can be called both by the data providers (the metadata editors) directly, during the process of metadata creation and by the harvester of repositories post-processing the collected data.

The output of the quality check could be fed into VLO (as some recommendation/ranking widget for in the individual md-record view), but should primarily be available (as a report) to the content providers, so that they can see, what needs to be fixed with their data. As such, it has ties to the monitoring work package.

Curation of individual facets/fields/data categories

An alternative to inspecting individual records is to go by individual data categories or fields, the goal being harmonization/normalization of the values. Notorious example being the dozens of variants for MPI for Psycholinguistics in the CMD records. Obvious candidates for inspection are the ones used in the VLO, but this can be gradually extended (e.g. SizeInfo, Availability/Licensing). This task requires – at least partially – manual work, although there is certainly plenty of room for sophisticated automatic preprocessing.

A possible workflow has been already tested within the CLAVAS project (Vocabulary Service for CLARIN) for the organisation names: the list of organisation names taken from VLO has been manually (pre-)normalized and imported into the vocabulary repository (OpenSKOS), so that now individual organizations (and departments) are modelled as skos:Concept and their spelling variants as skos:altLabel. This initial import is just an bootstrapping step, to seed the vocabulary. Further work (further normalization, adding new entities) happens in the Vocabulary Repository/Editor?, that allows collaborative work, provides an REST API etc. The Vocabulary Repository also already provides the list of Language Codes (ISO-639-3) and the closed data categories (with their values) from ISOcat. And it is open to support further vocabularies. (See also meeting on OpenSKOS and controlled vocabularies

Once a normalized vocabulary exists (though never complete), there are two options, to use it:

  1. metadata authoring tools use the vocabulary repository as source for suggest-functionality (autocomplete) for the user. This would require an (optional) extension to the definition of CMD elements, that would carry the information about the location of a vocabulary for given metadata element/field. (E.g. AFAIK MPI’s MD-editor Arbil is planned to support the OpenSKOS vocabularies)
  2. exploitation tools (like VLO) apply the normalization on the existing data

CMDI 1.2 will integrate possibility to refer to vocabularies and concepts in the vocabularies, that would allow to enrich the MD records right away.

Integration of foreign metadata formats

This is the third area of interest for the MD curation task.

Aspiring to encompass all resources in the LRT domain, CMD community is eager to incorporate existing MD formats. The approach is to generate CMD profiles that remodell these foreign formats (binding the fields to data categories!) and provide routines to transform data from the original formats to instances of the corresponding CMD profiles.

This has been done for dublincore/OLAC and IMDI and there is also work on TEI (teiHeader) and METASHARE (resourceInfo). However, currently there are multiple profiles for most of these external formats. So one challenge seems to be to unify/harmonize the different approaches, where possible or at least clarify why there is a need for multiple profiles and offer guidance to future projects and modellers as to which profile is best suited for which scenario.

See also CMDI Interoperability Workshop (with presentations to DC/OLAC, TEI, METASHARE).

One tool to explore and inspect the profiles is the SMC Browser. It compiles the information from the Component Registry and ISOcat into a graph, visualizing the reuse of components and data categories. It is planned to hook it with the instance data, which would allow to see where which profiles, components, data categories are used.