wiki:CMDI 1.2/Vocabularies/Summary

Version 1 (modified by Twan Goosen, 10 years ago) (diff)

bootstrapped page based on selection from CMDI 1.2/Vocabularies

This page is a subpage of CMDI 1.2

External vocabularies in CMDI 1.2: Executive summary

This page provides an executive summary of the issue and proposed solution fully described in CMDI 1.2/Vocabularies.

Issue description

The issue is about utilising external vocabularies as value domains for CMDI elements and CMDI attributes (i.e. Attribute elements. More specifically, how to do this using the OpenSKOS-based CLAVAS vocabulary service.

This has been described and explored on the page CmdiClavasIntegration. The issue description below is based on the one found on that page, and subsequent discussions within the CMDI Taskforce.

CLAVAS

OpenSKOS is a vocabulary service by which users may publish, manage and use SKOS-ified vocabulary data. The data can be accessed by a publicly available RESTful API. At this point only basic SKOS is supported. Also, some Dublin Core elements are included, but not indexed.

CLAVAS is a CLARIN-NL's application of OpenSKOS. It is hosted by the Meertens Institute, contact is Hennie Brugman?. The vocabularies will be available through a set of web services hosted at Meertens. CLAVAS is based on OpenSKOS, where vocabularies map to 'concept schemes' in OpenSKOS. More information is available in the CLAVAS work plan document.

This page covers integration of CLAVAS with CMDI. The proposed workflow is that metadata modelers can associate a vocabulary (identified by its URI) with an element in their components and profiles. The metadata creator will then be able to pick values from the specified vocabulary or (in the case of open vocabularies) still choose to use a custom value that does not appear in the vocabulary. Editors like Arbil need to be extended to access the CLAVAS public API for retrieving potential values from vocabularies.

PUBLIC API, FIND CONCEPTS, AUTOCOMPLETE

Available data

Solution description (proposed)

This section the solution as proposed at Utrecht Taskforce Meeting 21.2.2014

There are mainly 2 ways of using the OpenSKOS vocabularies in CMDI:

  • Importing vocabularies as closed value domains for CMD_elements or Attribute. Since the vocabulary items are enumerated explicitly as a choice list in the elements in question, validation is possible.
  • Using one or a combination of OpenSKOS vocabularies for dynamic lookup and retrieval of values for a CMDI element or Attribute. Here a non-exclusive (open) use of items from the vocabulary must be assumed, as validation against such external vocabularies is not practicable.

Schema changes

The following changes to the General Component Schema accommodates vocabulary use for both CMD_Element and Attribute:

  • New element <Vocabulary> in <ValueScheme?>
  • <Vocabulary> may have an <enumeration> element. If so, we have an internal, closed vocabulary (imported or locally specified). If not <enumeration>, then the Vocabulary is to be considered as external, and used as a lookup mechanism.

Attributes for <Vocabulary>

  • @URI
  • @ValueProperty? (which field of the vocabulary items to return, typically prefLabel)
  • @ValueLanguage?, (preferred language of the item field value)

General Component Schema changes

<xs:complexType name="ValueScheme_type">
        <xs:choice>
            <xs:element name="pattern" type="xs:string" maxOccurs="1">
                <xs:annotation>
                    <xs:documentation>Specification of a regular expression the element should
                        comply with.</xs:documentation>
                </xs:annotation>
            </xs:element>
            <xs:element name="Vocabulary" type="Vocabulary_type">
                <xs:annotation>
                    <xs:documentation>Specification of an open or closed vocabulary</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:choice>
    </xs:complexType>

    <xs:complexType name="Vocabulary_type">
         <xs:sequence>
            <xs:element name="enumeration" type="enumeration_type" minOccurs="0" maxOccurs="1">
                <xs:annotation>
                    <xs:documentation>A list of the allowed values of a controlled
                        vocabulary.</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
        <xs:attribute name="URI" type="xs:anyURI"/>
        <xs:attribute name="ValueProperty" type="xs:string"/> <!-- optionally selects a label -->
        <xs:attribute name="ValueLanguage" type="xs:string"/> <!-- optionally selects a language -->
    </xs:complexType>

CMD_Element example

       <CMD_Element 
            name="Language"
            CardinalityMax="1" 
            CardinalityMin="1"> 
            <ValueScheme>
                <Vocabulary
                    URI="http://openskos.org/api/languages"
                    ValueProperty="skos:prefLabel"
                    ValueLanguage="en">
                    <enumeration>
                        <item ConceptLink="http://cdb.iso.org/lg/CDB-00138580-001">Dutch</item>
                        <item ConceptLink="http://cdb.iso.org/lg/CDB-00138512-001">French</item>
                        ....
                    </enumeration>
                </Vocabulary>
            </ValueScheme>
        </CMD_Element>

Impact on tools

  • Metadata editors must facilitate vocabulary lookup. Arbil, as the most generic editor - should be prioritized.
  • Component Registry must facilitate import of vocabularies. Interface for specifying value domains for elements and Attributes must be updated.
  • Discovery services (VLO a.o.) could provide assistance for users through vocabularies. E.g. vocabulary-based browsing.

Comments/concerns

The proposed solution allows abuse to a certain degree, and it is vital to describe and motivate for good practices before bad practice proliferates. The main concern is connected to the possibility for importing vocabularies as controlled value ranges for CMD_Element and Attribute.

Avoiding multiplication of large vocabularies in CR

Since imported vocabularies are to be part of elements, and elements are not reusable, great care must be taken so that large enumeration lists are not duplicated across components. One way of achieving this is

  1. to consider which vocabularies are likely to be relevant in many profiles
  2. for each concept property that is relevant as ValueProperty? for some element in CR, define a component in CR containing one element only and import the property values of the vocabulary concepts as its closed value domain.
    • Example: The component iso-language-639-3 contains one element only - iso-639-3-code - taking values from a controlled vocabulary of language codes. (With the proposed 1.2 model, and given the CLAVAS vocabulary of langauges, ValueProperty? would have been set to "notation"). Some modelers may prefer to store the language names instead of or in addition to codes. To make sure this can be reused independently of language codes, another component containing a language name element (with ValueProperty?=prefLabel) should be defined.

Note: If the same effect is to be obtained for Attributes, they also will have to be wrapped separately in a component.

  • Example: Consider the component OLAC-DcmiTerms. Here all elements have similar attributes DcmiType? with the entire value list enumerated for each occurrence. If this is to be avoided in the proposed 1.2 model, elements and attributes must be wrapped separately into components throughout.

Importing partial vocabularies hampers reuse

The proposed model does not force the modeller to import entire vocabularies only, - it is possible to import only subsets from a larger vocabulary. For example, in a specific language element, the component creator may choose to import only the languages relevant in his/her user community. Such practice should be discouraged, as it renders the component unusable for anyone who needs access to more/other languages, event though the component otherwise might be perfectly suitable. Not supporting partial imports while retaining the external vocabulary reference in the Component Registry should however drastically limit the number of such occurrences.