wiki:MDCuration

Metadata Curation

Some notes on the issue of metadata curation from the stance of the Curation Taskforce.

Primary concern of this taskforce is the quality of the CMD records (i.e. instances), however the quality of the schemas/profiles obviously plays a role.

However the ultimate goal is the improvement of the resource discovery (as that's what metadata is mainly for).

Anecdotal Checks

Currently VLO features a link to report errors in metadata encountered during browsing.

This input is been propagated to a few persons, who either try to resolve it directly or generate a ticket in the metadata curation queue {18}.

Automatic Checks - Quality Assessement Service

To ensure certain quality aspects (like schema validation, link resolution etc.) systematic automatic regular checks are indispensable.

Thus a proposal for a Metadata Quality Assessment Service.

For more details see MDQAS.

Value Normalization - Controlled Vocabularies

We are confronted with a massive variability in the element values seriously affecting the usability of the faceted search. The reasons are mainly spelling variations, use of synonyms and also differing interpretations of the semantics of given field (different projects use different vocabularies / conceptual domains in the same element (or better possibly in different elements but linked to the same data category).

There is a lot we can do about it, the most important being the systematic use of common vocabularies, ideally already for metadata authoring, realistically in the curation step after harvesting before ingesting into the catalogue.

See more under Value Normalization.

Granularity

The CMD records describe all kinds of resources, especially also aggregated resources (like collections, corpora, ...). CMD also features an elaborate mechanism for modelling recursive collections. However it seems rather challenging to accommodate this information in a faceted search setup. Thus currently only the simple literal value of the header-element MdCollectionDisplayName is used as a facet. Also collections are mostly not included into VLO, due to the restriction to import only records that have either a ResourceProxy of type Resource, SearchPage, LandingPage. A collection has usually only Metadata-ResourceProxies (links to the collection items). (@VLO-team: correct me if I'm wrong.)

Integration of foreign metadata formats

Aspiring to encompass all resources in the LRT domain, CMD community is eager to incorporate existing MD formats. The approach is to generate CMD profiles that remodell these foreign formats (binding the fields to data categories!) and provide routines to transform data from the original formats to instances of the corresponding CMD profiles.

This has been done for dublincore/OLAC and IMDI and there is also work on TEI (teiHeader) and METASHARE (resourceInfo). However, currently there are multiple profiles for most of these external formats. So one challenge seems to be to unify/harmonize the different approaches, where possible or at least clarify why there is a need for multiple profiles and offer guidance to future projects and modellers as to which profile is best suited for which scenario.

See also CmdiInterop, CMDI Interoperability Workshop (with presentations to DC/OLAC, TEI, METASHARE).

One tool to explore and inspect the profiles is the SMC Browser. It compiles the information from the Component Registry and ISOcat into a graph, visualizing the reuse of components and data categories. It is planned to hook it with the instance data, which would allow to see where which profiles, components, data categories are used.

Curation in the broader picture

Following diagram sketches a tentative generic processing workflow for metadata from harvesting to presentation in the catalog with the Curation module at the core of the process:

Sketch of a generic processing workflow for metadata from harvesting via curation to presentation

Last modified 10 years ago Last modified on 10/22/14 17:00:20