wiki:CMDI 1.2/Specification

Version 24 (modified by teckart@informatik.uni-leipzig.de, 9 years ago) (diff)

First draft for CCSL specification

NOTE: This page is currently under development and should be considered a draft. If you wish to contribute, please contact the authors.

Notes from a recent meeting concerning the CMDI specification can be found here

Component Metadata Infrastructure (CMDI) 1.2 [DRAFT]

Introduction

The goal of the Component Metadata Infrastructure (CMDI) specification...

TODO

History

TODO

Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC2119.

Glossary

Work in progress. Responsible for this section: Thorsten & Twan

Please do not edit here, but use the Google Docs version!

  • CMD model, Component Metadata model
    • The component based metadata model described in the present specification
  • CMDI, Component Metadata Infrastructure
    • Metadata description framework consisting of the CMD model and infrastructure
  • CCSL, CMDI Component Specification Language
    • XML based language for describing components according to the CMD model
  • CLARIN
  • resource, language resource
    • A (digitally) accessible entity that can be described in terms of its content and technical properties, referenced by a Uniform Resource Identifier
  • digital object
    • Resource in a repository stored in one repository container that can be addressed by an identifier; a digital object can be seen as a generalization of a directory in a file system containing one or more files which are the data stream(s). Digital objects can exist in databases, hence the comparison to directory and file structures falls short.
  • metadata
    • A description of a resource, usually given as a set of properties in the form of attribute-value pairs. This description may contain information about the resource, aspects or parts of the resource and/or artefacts and actors connected to the resource.
  • persistent identifier, PID
    • Unique Uniform Resource Identifier that assures permanent access for a digital object by providing access to it independently of its physical location or current ownership
  • concept
    • An abstract or generic idea generalized from particular instances (source: Merriam-Webster)
  • semantic registry
    • A list/directory/system maintaining (authoritative) definitions of terms, concepts or data categories. These registries should also provide persistent identifiers for their entries.
  • concept link
    • A reference from a CMD profile, CMD component, CMD element, CMD attribute or a value in a controlled vocabulary to an entry in a semantic registry via its persistent identifier.
  • CLARIN Concept Registry
  • CMD instance, metadata instance, CMDI file, metadata record, CMD record
    • A file that conforms to the general CMDI instance structure as described in this specification, and at the instance payload level follows the specific structure defined by the CMD specification it relates to
  • Instance header
    • The section of a metadata instance marked as ‘header’, providing information on that metadata instance as such, not the resource that is described by the metadata file
  • Resource proxy, CMD resource reference
    • A representation of a resource within a metadata instance containing a Uniform Resource Identifier as a reference to the resource itself and a specification of its type (one of: Resource, Metadata, SearchPage, SearchService, LandingPage)
  • Resource proxy reference
    • A reference from any point within the instance payload to any of the resource proxies
  • Instance payload(?)
    • The section of a metadata instance that follows the structure defined by the profile it references and contains the description of the resources to which that metadata instance relates
  • CMD specification, component specification/definition, profile specification/definition
    • The implementation of a CMD component or CMD profile by means of the CCSL
  • Specification header, component header, profile header
    • The section of a CMD specification marked as ‘header’, providing information on that specification as such that is not part of the defined structure
  • CMD component, component
    • A reusable, structured template for the description of (an aspect of)a resource, defined by means of a CMD specification document with the potential of embedding other components by reference
  • CMD profile, profile definition, profile
    • A CMD component that is used to describe a class of resources and is not embedded into other components, and therefore provides the complete structure for an instance payload
  • CMD element, element definition
    • A unit of a CMD component that describes the level of the metadata instance that can carry atomic values constrained by a value scheme, and does not contain further levels except for that of the CMD attribute
  • CMD attribute
    • A unit of a CMD element that describes the level at which properties of a CMD element can be provided by means of value scheme constrained atomic values.
  • value scheme
    • A set of constraints governing the range of  values allowed for a specific CMD element or CMD attribute in a metadata instance, expressed in terms of an XML schema datatype, controlled vocabulary, or regular expression
  • controlled vocabulary, closed/open vocabulary
    • A set of values that can be used either to constrain the set of permissible values or to provide suggestions for applicable values in a given context
  • regular expression

Normative References

RFC2119
Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997,
http://www.ietf.org/rfc/rfc2119.txt
XML-Namespaces
Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009,
http://www.w3.org/TR/2009/REC-xml-names-20091208/

Non-Normative References

RFC3023
XML Media Types, IETF RFC 3023, January 2001,
http://www.ietf.org/rfc/rfc3023.txt

Typographic and XML Namespace conventions

The following typographic conventions for XML fragments will be used throughout this specification:

  • <prefix:Element>
    An XML element with the Generic Identifier Element that is bound to an XML namespace denoted by the prefix prefix.
  • @attr
    An XML attribute with the name attr
  • string
    The literal string must be used either as element content or attribute value.

The following XML namespace names and prefixes are used throughout this specification. The column "Recommended Syntax" indicates which syntax variant SHOULD be used by the Endpoint to serialize the XML response.

Prefix Namespace Name Comment Recommended Syntax
cmd http://clarin.eu/cmd CMDI instance prefixed

TODO: update namespaces

Structure of CMDI-files

Responsible for this section: Oddrun

A CMDI file contains the actual metadata of one specific resource (hereafter referred to as the described resource), and might also be referred to as a CMDI record. All CMDI files have the same structure at the top level. At a lower level, parts of its structure are defined by the CMDI profile upon which it is based.

The main structure

A CMDI file has the root element CMD with 4 subelements:

  • The Header element, containing certain administrative information about the CMDI file, i.e. metadata about the file itself
  • The Resources element, listing resource proxies and their interrelations, by the following subelements
  • IsPartOf? list, containing a list of IsPartOf? elements, each referencing a larger external resource of which the described resource (as a whole) forms a part
  • Components, containing one subelement corresponding to – and in turn structured according to - the CMDI profile applied.

The profile substructure exist in the profile-specific namespace, all the rest within the cmd namespace.

<About local attributes here>

In the following the main parts are described in detail

The header

NameMdCreator?
DescriptionDenotes the creator of this metadata file
Value typeA string
Occurrences0 to unbounded
Attributes

State purpose of header List elements in a table, giving name, "definition", type, cardinality for each

The resources section

The Resource proxy list

State purpose of Resource Proxy list (and which files should be listed here) Specify in detail how resource proxies are represented:

  • all possible elements and attributes with definition, type, cardinality/obligation

The Journal File Proxy List

State purpose of Journal File Proxy list (and which files should be listed here) Specify in detail how resource proxies are represented:

  • all possible elements and attributes with definition, type, cardinality/obligation

The Resource Relation List

State purpose of Resource Relation List (representing binary relations between resource (proxies) and/or other resources Specify in detail how resource relation are represented:

  • all possible elements and attributes with definition, type, cardinality/obligation

The Is-Part-of List

State purpose of Is-Part-of List (representing external resources that the described resource is a part of) (NOTE: IsPartOfList? no longer in Resources section) Specify in detail how an Is-part-of relation is represented:

  • all possible elements and attributes with definition, type, cardinality/obligation

The components

Sate purpose of components section, and its dependency upon profile (as given in header: MdProfile?)

The CMDI Component Specification Language

Responsible for this section: Thomas

The CMDI Component Specification Language (CCSL) is used to describe a CMD component or CMD profile. Hence, a CCSL document provides the structure of an aspect of a resource or (in the case of a profile specification) the complete structure of the instance's payload. It is also basis for the generation of the XML schema file that is used to validate a CMD instance (see section Transformation of CCSL into a schema for details). A CCSL document consists of two sections, the CCSL header and the actual CMD component description. Its root element must contain an XML attribute isProfile to indicate if the document specifies a CMD profile or a CMD component. Figure XY show the relation of the individual elements of the CCSL.

CCSL header

The CCSL header provides information relevant to identify and describe the component. This part includes a persistent identifier, the name, and a description of the component. The header also supports information about the status of the specification. These include a mandatory element indicating the component's status in its lifecycle (using the three lifecycles development, production, or deprecated) and an optional element statusComment to contain information about the reason for the current status. In the case of a deprecated specification that was succeeded by a new specification, the identifier of the direct successor should be stored in the element Successor.

CMD Component definition

Components are defined as a sequence of elements which may be followed by other components. The later is allowed because components may be embedded in other components. The specification of a CMD components contains the name of the component, the component's identifier, an optional concept link, and information about the allowed cardinality of the component. Furthermore documentation texts and further CMD attributes may be specified. The following table contains a summary of allowed specifications for a CMD component.

Name Element/Attribute? Valuetype Description
name Attribute xs:Name Name of the component
ComponentId? Attribute xs:anyURI Identifier of the component
ConceptLink? Attribute xs:anyURI Concept link
CardinalityMin? Attribute xs:nonNegativeInteger Minimum number of times this component has to occur
CardinalityMax? Attribute xs:nonNegativeInteger or “unbounded” Maximum number of times this component may occur
Documentation Element xs:string Documentation about the purpose of the component
AttributeList? Element xs:complexType Additional attributes specified by the component creator

CMD element definition

CMD elements are a template for storing atomic values constrained by a value scheme in a CMD instance. All relevant information and restrictions for such an element is contained in the CMD element definition. Most of this information is stored in XML attributes. This includes the mandatory name of the element, an optional concept link, the value schema, and information about the allowed cardinality of the element. Furthermore it can be indicated if the element may have different instance values in multiple languages, and hence an unlimited upper cardinality bound. Besides standard XML schema datatypes the value of a CMD element can be constrained by using regular expressions or vocabularies. The latter can be specified by giving the complete list of allowed values or by stating the URI of an external vocabulary (for details see Value restrictions for elements and attributes). If the instance's content of the element can be derived from other values, the element AutoValue? may be used to give indication about the derivation function. The CCSL does not prescribe or suggest a specific set of derivation functions. The following table contains a summary of allowed specifications for a CMD element.

Name Element/Attribute? Valuetype Description
name Attribute xs:Name Name of the element
ConceptLink? Attribute xs:anyURI Concept link
ValueScheme? Attribute Subset of XSD datatypes Allowed data type if simple XML type is used
CardinalityMin? Attribute xs:nonNegativeInteger or “unbounded” Minimum number of times this element has to occur
CardinalityMax? Attribute xs:nonNegativeInteger or “unbounded” Maximum number of times this element may occur
Multilingual Attribute xs:boolean Indication that the element can have values in multiple languages
Documentation Element xs:string Documentation about the purpose of the element
AttributeList? Element xs:complexType Additional attributes specified by the component creator
ValueScheme? Element xs:complexType Value restrictions based on a regular expression or a specified vocabulary
AutoValue? Element xs:string Derivation rules for the element's content

CMD attribute definition

Both the CMD element and component description allow the specification of additional CMD attributes. Every CMD attribute is specified using similar attributes and elements as for CMD elements. The following table contains a summary of allowed specifications for a CMD attribute.

Name Element/Attribute? Valuetype Description
name Attribute xs:Name Name of the attribute
ConceptLink? Attribute xs:anyURI Concept link
ValueScheme? Attribute Subset of XSD datatypes Allowed data type if simple XML type is used
Required Attribute xs:boolean Indication if attribute is required
Documentation Element xs:string Documentation about the purpose of the attribute
ValueScheme? Element xs:complexType Value restrictions based on a regular expression or a specified vocabulary
AutoValue? Element xs:string Derivation rules for the attribute's content

Value restrictions for elements and attributes

Apart from standard XML schema datatypes the content of a CMD element or attribute instance can be restricted by two means. The ValueScheme? element may contain either an XML element pattern with the specification of a regular expression the element should comply with, or the definition of a vocabulary of allowed values. CMDI 1.2 supports two approaches to describe such a vocabulary:

  • specifying all allowed values with optional attributes for every value to include a concept link and a description of the specific value, or
  • referring to an external vocabulary via a URI specified in the attribute URI. Optional XML attributes ValueProperty? and ValueLanguage? may be used to give more information about preferred label and language in the chosen vocabulary.

Cues attributes

All CMD attribute, element, and component specifications may contain additional attributes with the namespace “http://www.clarin.eu/cmdi/cues/display/1.0”. These may be used to give information about how the payload contained in CMD instances should be presented. Different styles for the same CMD component may be developed. The CCSL does not prescribe or suggest a specific set of these cue attributes.

Transformation of CCSL into a CMD profile schema

Responsible for this section: Twan

A CMD instance document that is serialised as XML according this specification SHOULD reference the location of a CMD profile schema. The infrastructure MUST provide a mechanism to derive such a schema for any specific CMD profile on basis of its definition and that of the CMD components that it references. This section specifies how different aspects of a CMD specification should be transformed into elements of a schema document. The primary schema language targeted is XML Schema, although the infrastructure MAY provide support for other schema languages, such as DDML or Relax NG.

  • CMD profile schemas SHOULD NOT (MUST NOT?) be derived from CMD specifications that are not CMD profiles.

Global schema properties

  • Linked components should be included, expanded
  • A CMD profile schema MUST be a single document [or set of linked documents with a single entry point](?) that allows for the evaluation of CMD instance on all levels of description defined in one specific CMD profile.
  • The CMD profile schema MAY include, as a matter of annotation, a copy of (a subset of) the header information contained in the CMD profile from which it is defined.
  • The CMD profile schema MUST use the following namespaces:
  • targeted namespace
  • for annotation and documentation purposes that are outside the scope of instance validation
  • for embedded semantic annotation

Interpretation of CMD header

Interpretation of CMD component definitions in the CCSL

  • Interpretation of hierarchies in the CCSL
  • concept links
  • order of children
  • elements -> see elements
  • attributes -> see attributes

Interpretation of CMD element definitions in the CCSL

  • content model (value scheme)
  • concept links
  • order of children
  • attributes -> see attributes

Interpretation of CMD attribute definitions in the CCSL

  • content model (value scheme) - same as element??
  • concept links

Appendices

Bibliography

IETF RFC 2045, Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies

IETF RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types

IETF RFC 5646, Tags for Identifying Languages

ISO 639‐1, Codes for the representation of names of languages — Part 1: Alpha-2 code

ISO 639‐3, Codes for the representation of names of languages -- Part 3: Alpha-3 code for comprehensive coverage of languages

ISO 3166‐1, Codes for the representation of names of countries and their subdivisions — Part 1: Country codes

ISO 8601, Data elements and interchange formats — Information interchange — Representation of dates and times

ISO/IEC 10646‐1, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane XML Schema Part 2: Datatypes, Biron, P.V. and Malhotra, A. (eds.), W3C Recommendation 02 May 2001, available at <http://www.w3.org/TR/xmlschema-2/>

Attachments (12)

Download all attachments as: .zip