wiki:CMDI 1.2/Vocabularies

This page is a subpage of CMDI 1.2

External vocabularies in CMDI 1.2

An executive summary is available at CMDI 1.2/Vocabularies/Summary

The issue

The issue is about utilising external vocabularies as value domains for CMDI elements and CMDI attributes (i.e. Attribute elements. More specifically, how to do this using the OpenSKOS-based CLAVAS vocabulary service.

This has been described and explored on the page CmdiClavasIntegration. The issue description below is based on the one found on that page, and subsequent discussions within the CMDI Taskforce.

CLAVAS

OpenSKOS is a vocabulary service by which users may publish, manage and use SKOS-ified vocabulary data. The data can be accessed by a publicly available RESTful API. At this point only basic SKOS is supported. Also, some Dublin Core elements are included, but not indexed.

CLAVAS is a CLARIN-NL's application of OpenSKOS. It is hosted by the Meertens Institute, contact is Hennie Brugman?. The vocabularies will be available through a set of web services hosted at Meertens. CLAVAS is based on OpenSKOS, where vocabularies map to 'concept schemes' in OpenSKOS. More information is available in the CLAVAS work plan document.

This page covers integration of CLAVAS with CMDI. The proposed workflow is that metadata modelers can associate a vocabulary (identified by its URI) with an element in their components and profiles. The metadata creator will then be able to pick values from the specified vocabulary or (in the case of open vocabularies) still choose to use a custom value that does not appear in the vocabulary. Editors like Arbil need to be extended to access the CLAVAS public API for retrieving potential values from vocabularies.

Status of CLAVAS

Public API

The RESTful API complies with the Apache Lucene Query Parser Syntax. This allows one to search for concepts within a ConceptScheme? (vocabulary).

In addition the vocabulary data may be harvested from an OAI-PMH endpoint, at which the data are organised in sets corresponding to individual vocabularies.

Find concepts

Example queries:

Autocomplete

The autocomplete API is a simplified version of Find concepts, by which single property values may be retrieved as a list. However, delimiting search to individual vocabularies doesn't appear to be possible with autocomplete.

Example query: https://openskos.meertens.knaw.nl/api/autocomplete/Norw?returnLabel=prefLabel&searchLabel=prefLabel

There is also an autocomplete variant of Find concepts, for example:

Harvesting

Example query:

Eventually a CLAVAS-specific (as opposed to openskos.org) service will become available.

Management interface and editor

A concept scheme editor is available but not publicly accessible.

Available data

Solution description (proposed)

This section the solution as proposed at Utrecht Taskforce Meeting 21.2.2014

There are mainly 2 ways of using the OpenSKOS vocabularies in CMDI:

  • Importing vocabularies as closed value domains for CMD_elements or Attribute. Since the vocabulary items are enumerated explicitly as a choice list in the elements in question, validation is possible.
  • Using one or a combination of OpenSKOS vocabularies for dynamic lookup and retrieval of values for a CMDI element or Attribute. Here a non-exclusive (open) use of items from the vocabulary must be assumed, as validation against such external vocabularies is not practicable.

Schema changes

The following changes to the General Component Schema accommodates vocabulary use for both CMD_Element and Attribute:

  • New element <Vocabulary> in <ValueScheme?>
  • <Vocabulary> may have an <enumeration> element. If so, we have an internal, closed vocabulary (imported or locally specified). If not <enumeration>, then the Vocabulary is to be considered as external, and used as a lookup mechanism.

Attributes for <Vocabulary>

  • @URI
  • @ValueProperty? (which field of the vocabulary items to return, typically prefLabel)
  • @ValueLanguage?, (preferred language of the item field value)

General Component Schema changes

<xs:complexType name="ValueScheme_type">
        <xs:choice>
            <xs:element name="pattern" type="xs:string" maxOccurs="1">
                <xs:annotation>
                    <xs:documentation>Specification of a regular expression the element should
                        comply with.</xs:documentation>
                </xs:annotation>
            </xs:element>
            <xs:element name="Vocabulary" type="Vocabulary_type">
                <xs:annotation>
                    <xs:documentation>Specification of an open or closed vocabulary</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:choice>
    </xs:complexType>

    <xs:complexType name="Vocabulary_type">
         <xs:sequence>
            <xs:element name="enumeration" type="enumeration_type" minOccurs="0" maxOccurs="1">
                <xs:annotation>
                    <xs:documentation>A list of the allowed values of a controlled
                        vocabulary.</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
        <xs:attribute name="URI" type="xs:anyURI"/>
        <xs:attribute name="ValueProperty" type="xs:string"/> <!-- optionally selects a label -->
        <xs:attribute name="ValueLanguage" type="xs:string"/> <!-- optionally selects a language -->
    </xs:complexType>

CMD_Element example

       <CMD_Element 
            name="Language"
            CardinalityMax="1" 
            CardinalityMin="1"> 
            <ValueScheme>
                <Vocabulary
                    URI="http://openskos.org/api/languages"
                    ValueProperty="skos:prefLabel"
                    ValueLanguage="en">
                    <enumeration>
                        <item ConceptLink="http://cdb.iso.org/lg/CDB-00138580-001">Dutch</item>
                        <item ConceptLink="http://cdb.iso.org/lg/CDB-00138512-001">French</item>
                        ....
                    </enumeration>
                </Vocabulary>
            </ValueScheme>
        </CMD_Element>

Impact on tools

  • Metadata editors must facilitate vocabulary lookup. Arbil, as the most generic editor - should be prioritized.
  • Component Registry must facilitate import of vocabularies. Interface for specifying value domains for elements and Attributes must be updated.
  • Discovery services (VLO a.o.) could provide assistance for users through vocabularies. E.g. vocabulary-based browsing.

Comments/concerns

The proposed solution allows abuse to a certain degree, and it is vital to describe and motivate for good practices before bad practice proliferates. The main concern is connected to the possibility for importing vocabularies as controlled value ranges for CMD_Element and Attribute.

Avoiding multiplication of large vocabularies in CR

Since imported vocabularies are to be part of elements, and elements are not reusable, great care must be taken so that large enumeration lists are not duplicated across components. One way of achieving this is

  1. to consider which vocabularies are likely to be relevant in many profiles
  2. for each concept property that is relevant as ValueProperty? for some element in CR, define a component in CR containing one element only and import the property values of the vocabulary concepts as its closed value domain.
    • Example: The component iso-language-639-3 contains one element only - iso-639-3-code - taking values from a controlled vocabulary of language codes. (With the proposed 1.2 model, and given the CLAVAS vocabulary of langauges, ValueProperty? would have been set to "notation"). Some modelers may prefer to store the language names instead of or in addition to codes. To make sure this can be reused independently of language codes, another component containing a language name element (with ValueProperty?=prefLabel) should be defined.

Note: If the same effect is to be obtained for Attributes, they also will have to be wrapped separately in a component.

  • Example: Consider the component OLAC-DcmiTerms. Here all elements have similar attributes DcmiType? with the entire value list enumerated for each occurrence. If this is to be avoided in the proposed 1.2 model, elements and attributes must be wrapped separately into components throughout.

Importing partial vocabularies hampers reuse

The proposed model does not force the modeller to import entire vocabularies only, - it is possible to import only subsets from a larger vocabulary. For example, in a specific language element, the component creator may choose to import only the languages relevant in his/her user community. Such practice should be discouraged, as it renders the component unusable for anyone who needs access to more/other languages, event though the component otherwise might be perfectly suitable. Not supporting partial imports while retaining the external vocabulary reference in the Component Registry should however drastically limit the number of such occurrences.

Solution description (old)

(In this section the initial proposal is described, later replaced by the solution above. Included for for provenance reasons)

Open/closed vocabularies

Open vocabularies

Each concept within a vocabulary is identified by an OpenSKOS specific URI and optionally has a reference to a 'source' URI (e.g. ISOCat). For fields that link to vocabularies as open vocabularies, we want to store one of these URIs as an attribute in CMDI metadata instances. There will be a deterministic fall-back mechanism in Arbil that chooses the source URI if available, otherwise the CLAVAS URI.

The value that serves as the text content should come from one of the child elements of the concept definition. Typically this will be the preferred label as specified in the vocabulary item returned from the API but it could also come from another element (e.g. to choose between item full name and item code). Which path to use should be determined in the component specification (possibly part of the vocabulary URI).

Closed vocabularies

Closed vocabularies will be no different from standard CMDI closed vocabularies on the instance level. All required information will be available from the schema due to the vocabulary import (see below).

Centre impact

Infrastructure:

  • Component Registry

Tools:

  • Editors (Proforma, Arbil) should present the vocabulary items to the user and insert the URI and display value of the chosen item into the CMDI instance
  • Viewers and catalogues could provide a link to the vocabulary item in addition to showing the display value

Implementation examples

Distinguishing between two largely separate cases: open and closed vocabularies. Open vocabularies imply a link to an external vocabulary while closed vocabularies are statically imported into the component specification.

Open Vocabularies

Component and profile specifications

Specified using attributes on CMD_Element

  • ValueScheme has to be string. For this we add a schematron rule to the general component schema.
  • Vocabulary URI (not necessarily a URL) gets specified in Vocabulary attribute
  • Value property field optionally specified (default=prefLabel) as a separate attribute VocabValueProperty?
    • Use case is language vocabulary that provides versions (e.g. full name, code) of ISO-639 per item
    • It would be nice if we could pass the 'label' selection on to the API so that a pre-selection can happen server side, returning it in a specially marked element or attribute (making the processing of the response more uniform)

Example:

        <CMD_Element
            name="Language"
            CardinalityMax="1"
            CardinalityMin="1">
            <ValueScheme>
                <Vocabulary
                    URI="http://openskos.org/api/institutions"
                    ValueProperty="skos:notation">
                </Vocabulary>
            </ValueScheme>
        </CMD_Element>
Profile XSD's

The values of the vocabulary related attributes could go straight into the generated profile XSD, pretty much like the "datcat"-attributes and read like that from the schema by client applications.

Example:

<xs:element 
  name="Institute"  
  cmd:Vocabulary="http://openskos.org/institutions" 
  cmd:VocabValueProperty="skos:notation">
  <xs:complexType>
    <xs:simpleContent>
      <xs:extension base="xs:string">
        <xs:attribute ref="xml:lang"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>
</xs:element>
Instances

An attribute ValueConceptLink (in the CMDI namespace to disambiguate and prevent clashes with potential custom attributes of this name, see CMDI 1.2/Attributes) will be allowed on fields that have a vocabulary linked to hold the URI of the selected value, semantically marking the chosen option

<Institution cmd:ValueConceptLink="http://openskos.org/api/institution/beng">Sound and Vision</Institution>

Closed Vocabularies

Component and profile specifications

Closed vocabularies will be 'imported' into the component design-time, resulting in an internalized 'snapshot copy' of the vocabulary at the time of creation. The ComponentRegistry will be extended with functionality to allow this. The vocabulary URI will be stored in the component specification and transferred to the schema so that editors can query the API for additional information but this is optional as all information including the item URI's will be available from the schema.

We will add the vocabulary uri as an attribute to the element and re-use the existing ConceptLink attribute on the enumeration items to store the identifier of individual vocabulary items.

        <CMD_Element
            name="Language"
            CardinalityMax="1"
            CardinalityMin="1">
            <ValueScheme>
                <Vocabulary
                    URI="http://openskos.org/api/languages"
                    ValueProperty="skos:prefLabel"
                    ValueLanguage="en">
                    <enumeration>
                        <item ConceptLink="http://cdb.iso.org/lg/CDB-00138580-001">Dutch</item>
                        <item ConceptLink="http://cdb.iso.org/lg/CDB-00138512-001">French</item>
                    </enumeration>
                </Vocabulary>
            </ValueScheme>
        </CMD_Element>

Text content comes from the selected label. ConceptLink has the URI for each item in the vocabulary. There probably is no need for AppInfo? (separate display label). Notice that there currently is no way to represent multilingual vocabularies, so the language will have to be specified in the vocabulary URI with a fallback to the default language of the vocabulary.

Profile XSD's

Like in the component spec, a small extension on the way closed vocabularies are represented in profile schemata in CMDI 1.1. Only the cmd:Vocabulary and cmd:VocabValueProperty attributes are added to the simpleType.

Example:

<xs:simpleType 
  name="simpletype-iso-639-3-code-clarin.eu.cr1.c_123456789" 
  cmd:Vocabulary="http://openskos.org/api/languages"
  cmd:VocabValueProperty="iso-639-3"
  cmd:VocabValueLanguage="en">
  <xs:restriction base="xs:string">
    <xs:enumeration value="Dutch" dcr:datcat="http://cdb.iso.org/lg/CDB-00138512-001" />
    <xs:enumeration value="French" dcr:datcat="http://cdb.iso.org/lg/CDB-00138512-001" />
  </xs:restriction>
</xs:simpleType>
Instances

No different from closed vocabularies in CMDI 1.1. The item URI is specified in the component spec/schema.

General component schema

A modification of ValueScheme_type as it exists in CMDI 1.1:

...
    <xs:complexType name="ValueScheme_type">
        <xs:choice>
            <xs:element name="pattern" type="xs:string" maxOccurs="1">
                <xs:annotation>
                    <xs:documentation>Specification of a regular expression the element should
                        comply with.</xs:documentation>
                </xs:annotation>
            </xs:element>
            <xs:element name="Vocabulary" type="Vocabulary_type">
                <xs:annotation>
                    <xs:documentation>Specification of an open or closed vocabulary</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:choice>
    </xs:complexType>

    <xs:complexType name="Vocabulary_type">
         <xs:sequence>
            <xs:element name="enumeration" type="enumeration_type" minOccurs="0" maxOccurs="1">
                <xs:annotation>
                    <xs:documentation>A list of the allowed values of a controlled
                        vocabulary.</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
        <xs:attribute name="URI" type="xs:anyURI" use="required"/>
        <xs:attribute name="ValueProperty" type="xs:string"/> <!-- optionally selects a label -->
        <xs:attribute name="ValueLanguage" type="xs:string"/> <!-- optionally selects a language -->
    </xs:complexType>
...

Tickets

Tickets in the CMDI 1.2 milestone with the keyword clavas:

Ticket Summary Owner Component Priority Status
#369 Add support for open and closed vocabularies in component specifications and instances Dieter Van Uytvanck ComponentSchema major closed

Discussion

Oddrun: Some issues:

SKOS or SKOS++? I find the above a bit confusing. According to my understanding, OpenSKOS provides APIs specifically to SKOS, and I take this to mean that the vocabularies we are trying to integrate with CMDI are all represented in SKOS. As you know, SKOS (plain version) has 3 types of labels, prefLabel, altLabel and hiddenLabel. Still, in the above examples additional label elements seem to be anticipated, like "name" and "organisation-name". So apparently, openSKOS.org allows for richer descriptions than SKOS? Does openSKOS.org have preferances for other ontologies, like FOAF, for instance?

Twan & Menzo: You are right, the SKOS namespace does not allow for such specifications. Either we will be limited to selecting one of these three label types (which in itself is problematic because there can be any number of altLabels), or we would need to depend on some extension available in CLAVAS. This is a bit unclear at the moment, so let's try to get things a bit more clear w.r.t. CLAVAS. In any case the code example on this page is not the best possible example so we will try to adapt.

Open vs. closed vocabularies As far as I understand, whether a vocabulary is open or closed represents - in this context at least - merely a mode of application,- it is not a feature of the vocabulary as such. Maybe I am missing something, but I fail to see the sense in treating open and closed vocabularies differently in the way proposed here. More specifically. I question the wisdom of snapshotting (possibly huge) vocabularies into the components and profiles. After all, the vocabularies are not static, and how do we make sure that the vocabulary copies are kept updated? My guess is that this - after a time - will result in as many vocabulary variants as there are components using it. So why not referring to closed vocabularies the same ways as open? The we probably need to encode in another way whether the vocabulary is to be used in a closed or open way.

Twan: I can see how this is not entirely clear. The first distinction we can make is external/internal vocabularies. Internal vocabularies (which is the only type we have in 1.1) can only be closed because they are expressed as choice in XSD and have implications for the validation. External vocabularies could conceptually be applied either way, which I think is the point you are trying to make. However giving an external vocabulary a closed status does not strictly limit a user to use its values, i.e. there is no straightforward validation mechanism. The best thing that could be achieved realistically is for this to make the editing tool(s) limit the entry to items in the vocabulary. Hence in the proposal the distinction external/internal is equivalent to the distinction open/closed. But we can of course consider a 'semi-closed' variant for external vocabularies.

Another thing: Would it be useful to be able to select from a union of vocabularies, instead of just one? True, such a need could be satisfied by choosing one vocabulary and declaring it open. But if all the additional items are to selected from another vocabulary, it would be nice to be able to say so.

Twan & Menzo: This could be done but maybe that could better be solved in CLAVAS, e.g. by having 'virtual vocabularies' (either pre-defined or dynamic unions of arbitrary concept schemes).

Multilingual vocabularies You say: "Notice that there currently is no way to represent multilingual vocabularies, so the language will have to be specified in the vocabulary URI with a fallback to the default language of the vocabulary". Do you mean "no way to represent in CMDI?" If closed vocabularies are handled "by reference", as the open ones, there is no need to represent multilinguality in CMDI components, at least in the cases where multilingual SKOS vocabularies are realised by generating one prefLabel per language, (e.g. "French"@en), retaining one single URI for the whole vocabulary. If prefLabel is to be used as default VocabValueProperty?, CMDI 1.2 must be able to handle multilingual prefLabel sets from the start.

Twan & Menzo: That's correct, see also out point above about open and closed. Internal vocabularies of course get frozen not just on the value domain but also on the selected label - that includes a language selection. And we probably don't want to consider multilingual closed (internal) vocabularies in CMDI 1.2 although the idea itself is interesting.

Discuss the topic in general below this point

Last modified 10 years ago Last modified on 09/12/14 13:46:13