wiki:CMDI 1.2/Vocabularies/Summary

Version 5 (modified by Twan Goosen, 10 years ago) (diff)

--

This page is a subpage of CMDI 1.2

External vocabularies in CMDI 1.2: Executive summary

This page provides an executive summary of the issue and proposed solution fully described in CMDI 1.2/Vocabularies.

Issue description

The objective is to utilise external vocabularies as value domains for CMDI elements and CMDI attributes. While the solution on the model level should be generic and service-agnostic, it will be designed specifically to work with the OpenSKOS-based CLAVAS vocabulary service.

CLAVAS

OpenSKOS is a vocabulary service by which users may publish, manage and use vocabulary data in the SKOS format. The data can be accessed by a set of publicly available RESTful API's. Also, some Dublin Core elements are included, but not indexed. CLAVAS is a CLARIN-NL's application of OpenSKOS, hosted by the Meertens Institute (contact is Hennie Brugman?). The vocabularies will be available through a set of web services hosted at Meertens, supporting exploration, search and autocomplete. More information is available in the CLAVAS work plan document.

The proposed workflow is that metadata modellers can associate a vocabulary (identified by its URI) with an element in their components and profiles. The metadata creator will then be able to pick values from the specified vocabulary or (in the case of open vocabularies) still choose to use a custom value that does not appear in the vocabulary. Editors like Arbil need to be extended to access the CLAVAS public API for retrieving potential values from vocabularies.

Currently, the following vocabularies are available in CLAVAS:

  • Languages
  • Organisations
  • All public ISOcat categories (only simple ones?)

Description of proposed solution

This section describes the solution as proposed at the Taskforce Meeting in Utrecht on 2014-2-12.

There are two ways of using the OpenSKOS vocabularies in CMDI:

  • Importing vocabularies as closed value domains for CMD_elements or Attribute. Since the vocabulary items are enumerated explicitly as a choice list in the elements in question, validation is possible.
  • Using one or a combination of OpenSKOS vocabularies for dynamic lookup and retrieval of values for a CMDI element or Attribute. Here a non-exclusive (open) use of items from the vocabulary must be assumed, as validation against such external vocabularies is not practicable.

Schema changes

The following changes to the General Component Schema accommodates vocabulary use for both CMD_Element and Attribute:

  • New element Vocabulary in ValueScheme?
  • Vocabulary optionally has an enumeration element. If so, this defines an internal, closed vocabulary (imported or locally specified). If enumeration is not present, then the Vocabulary will be considered to be external and open, and should be accessed by means of the API by tools supporting this.

Attributes for <Vocabulary>

  • @URI (mandatory)
  • @ValueProperty (mandatory; which field of the vocabulary items to return, typically prefLabel)
  • @ValueLanguage (optional; preferred language of the item field value)

Instance changes

An attribute ValueConceptLink (in the CMDI namespace) will be allowed on fields that have a vocabulary linked to hold the URI of the selected value, semantically marking the chosen option

Impact on tools

  • Metadata editors must facilitate vocabulary lookup. Arbil, as the most generic editor - should be prioritized.
  • Component Registry must facilitate import of vocabularies. Interface for specifying value domains for elements and Attributes must be updated.
  • Discovery services (VLO a.o.) could provide assistance for users through vocabularies. E.g. vocabulary-based browsing.

Comments/concerns

The proposed solution allows abuse to a certain degree, and it is vital to describe and motivate for good practices before bad practice proliferates. The main concern is connected to the possibility for importing vocabularies as controlled value ranges for CMD_Element and Attribute.

Avoiding multiplication of large vocabularies in CR

Since imported vocabularies are to be part of elements, and elements are not reusable, great care must be taken so that large enumeration lists are not duplicated across components. One way of achieving this is

  1. to consider which vocabularies are likely to be relevant in many profiles
  2. for each concept property that is relevant as ValueProperty for some element in CR, define a component in CR containing one element only and import the property values of the vocabulary concepts as its closed value domain.
    • Example: The component iso-language-639-3 contains one element only - iso-639-3-code - taking values from a controlled vocabulary of language codes. (With the proposed 1.2 model, and given the CLAVAS vocabulary of langauges, ValueProperty would have been set to "notation"). Some modelers may prefer to store the language names instead of or in addition to codes. To make sure this can be reused independently of language codes, another component containing a language name element (with ValueProperty=prefLabel) should be defined.

Note: If the same effect is to be obtained for Attributes, they also will have to be wrapped separately in a component.

  • Example: Consider the component OLAC-DcmiTerms. Here all elements have similar attributes DcmiType? with the entire value list enumerated for each occurrence. If this is to be avoided in the proposed 1.2 model, elements and attributes must be wrapped separately into components throughout.

Importing partial vocabularies hampers reuse

The proposed model does not force the modeller to import entire vocabularies only, - it is possible to import only subsets from a larger vocabulary. For example, in a specific language element, the component creator may choose to import only the languages relevant in his/her user community. Such practice should be discouraged, as it renders the component unusable for anyone who needs access to more/other languages, event though the component otherwise might be perfectly suitable. Not supporting partial imports while retaining the external vocabulary reference in the Component Registry should however drastically limit the number of such occurrences.