{{{#!div class="system-message" '''NOTE''': This page is currently under development and should be considered a draft. If you wish to contribute, please contact the authors. }}} {{{#!div class="notice system-message" Notes from a recent meeting concerning the CMDI specification can be found [[Taskforces/CMDI/Meeting20140730|here]] }}} = Component Metadata Infrastructure (CMDI) 1.2 [DRAFT] = [[PageOutline(1-5)]] == Introduction == The goal of the ''Component Metadata Infrastructure (CMDI)'' specification... === Terminology === The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [#REF_RFC_2119 RFC2119]. === Glossary === {{{#!comment TODO: Add definitions }}} CLARIN-FCS, FCS:: CLARIN federated content search, an interface specification to allow searching within resource content of repositories. PID:: A Persistent identifier is a long-lasting reference to a digital object. attribute:: synonym of XML attribute bundle:: collection in which the resources are tight together, having the same origin and are distributed together a media file and its annotation created and distributed by the same person CCSL, CMDI Component Specification Language:: XML based language for describing components according to the CMDI model CMDI, Component Metadata Infrastructure:: Metadata description framework consisting of the CMDI model and infrastructure collection:: set of resources described by common metadata and distributed as a unit, i.e. referenced by a single persistent identifier component:: list of metadata elements and other components of which every data element corresponds to a metadata category. Together they describe an aspect of a component, e.g. name, language, other metadata properties of a LRT data category, datcat:: result of the specification of a given data field 1. a type of data field, such as /definition/. 2. ISO 212620:2009 provides for the creation of an inventory of data categories. data category registry:: set of data categories to be used as a reference for the definition of linguistic annotation schemes or any other formats in the domain of language resources DCR, Data Category Registry, ISO TC37 Data Category Registry:: Data category registry used for ISO Technical Committee 37. The DCR is available at http://www.isocat.org. DCS, Data Category Selection:: set of data categories selected from the DCR data category selection:: set of attributes used to fully describe a given data element concept data stream:: constituent of a digital object individual files in a digitial object digital object, DO:: resource in a repository stored in one repository containter that can be addressed by an identifier a digital object can be seen as a generalization of a directory in a file system containing one or more files which are the data stream(s). Digital objects can exist in databases, hence the comparison to directory and file structures falls short. element:: synonym of XML element information unit, IU:: elementary piece of information attached to a level of the metamodel mimetype:: type of file as defined by IETF RFC 2045, IETF RFC 2046 and registered by IANA namespace:: synonym to XML namespace persistent identifier, PID:: unique Uniform Resource Identifier (URI) that assures permanent access for a digital object by providing access to it independently of its physical location or current ownership profile:: component that can be translated into a schema for metadata for a specific type of resource proxy:: placeholder for external data; a proxy provides a standard way of addressing otherwise unreachable resources registry:: central directory designed for persistent provision of negotiated information that can rebiably be accessed repository:: computer system for long time storage of resources resource:: entity that can be referenced by a URI resource type:: classification of a resource UML, Unified Modelling Language:: language for specifying, visualizing, constructing and documenting the artifacts of software systems URI, Uniform Resource Identifier:: identifier for locating resources on the internet value:: property of an attribute virtual collection:: collection in which the individual resources are loosely combined in a registry and do not necessarily exist in one digital object. === Normative References === RFC2119[=#REF_RFC_2119]:: Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ [http://www.ietf.org/rfc/rfc2119.txt] XML-Namespaces[=#REF_XML_Namespaces]:: Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009, \\ [http://www.w3.org/TR/2009/REC-xml-names-20091208/] {{{#!comment TODO: add references }}} === Non-Normative References === RFC3023[=#REF_RFC_3023]:: XML Media Types, IETF RFC 3023, January 2001, \\ [http://www.ietf.org/rfc/rfc3023.txt] {{{#!comment TODO: add references }}} === Typographic and XML Namespace conventions === The following typographic conventions for XML fragments will be used throughout this specification: * `` \\ An XML element with the Generic Identifier ''Element'' that is bound to an XML namespace denoted by the prefix ''prefix''. * `@attr` \\ An XML attribute with the name ''attr'' {{{#!comment * `@prefix:attr` \\ An XML attribute with the name ''attr'' that is bound to an XML namespaces denoted by the prefix ''prefix''. }}} * `string` \\ The literal ''string'' must be used either as element content or attribute value. The following XML namespace names and prefixes are used throughout this specification. The column "Recommended Syntax" indicates which syntax variant `SHOULD` be used by the Endpoint to serialize the XML response. ||=Prefix =||=Namespace Name =||=Comment =||=Recommended Syntax =|| || `cmd` || `http://clarin.eu/cmd` || CMDI instance || prefixed || {{{#!comment TODO: Add payload, envelope namespace and namespace for specification }}} == CMDI Component Metadata Model == {{{#!comment TODO: Write section }}} == CMDI Component and Profile Specification Level == {{{#!comment TODO: Write section }}} = Structure of CMDI-files = An example of a CMDI file can be found in Annex B: showing the overall struture of a metadata instance serialization. The structure of such an instance is described here. A metadata instance that is complient to this standard must follow this structure. Structurally a CMDI instance consists of three sections, the header, the resources and the components. The header and the resource section are statically defined and remain constant in the generation of an evaluative metadata schema. They are described here as they are required for creating the schema from the component specification. == The header == The header-element is a container element intended to provide information on the metadata file as such, not the resource that is described by the metadata file. To make this more explicit and human readable, the data categories contained in the header are prefixed by Md for Metadata. The following elements are part of the header, all of these elements are optional: • !MdCreator (optional): Name of the person who created the metadata file. This is defined as a string • !MdCreationDate (optional): Date of the creation of the metadata file. This is defined as the data type date, i.e. the date is specified in the form yyyy-mm-dd (four digits for the year, followed by a dash, followed by two digits for the month, followed by a dash, followed by two digits for the day of the month • !MdSelfLink (optional): Persistent identifier for the metadata file (see PISA) in the form of a URI • !MdProfile (mandatory): persistent identifier of the profile used to create this metadata file. This information is partially implied by the value of the schemalocation attribute of the root element, but the profile identifier may refer to the complete description of the profile such as the CCSL. • !MdCollectionDisplayName (mandatory): The name for a collection as it is supposed to be displayed by an application. This element is used because metadata is often shared and institutions display the names of the collections in applications • !MdRevisionGrp (optional): The group for storing metadata revisions if any, with at least one but possibly many child-element !MdRevision containing the name of the editor (element by with string content), the date of editing (element date of type xs:date) and a verbose note on the revision (element note of type string) It is always recommended to fill in all possible fields here. The idea for these fields is to structure the data and make information available, providing some background for the users of the metadata. Potential problems, intentionally left vague are how to deal with changed metadata files: should the !MdCreator and !MdCreationDate be adjusted? If yes, how persistent is the !MdSelfLink? As the metadata is created during the archiving state of a resource, potential updates are currently not dealt with. == The resources section == The resources section in a metadata file list all information relevant for the individual resource, but does not describe the resource as such. The description is part of the components, the resource section provides the location of the resource or its parts if it consists of more than one, provenance information on the resource, information on the relation between the parts of the resource, if applicable and information of a greater body the resource is part of, also if applicable. === The Resource proxy list === The resource proxy list defines metadata file internal placeholders, called proxies, for each part of a resource. For example, if a resource consists of one specific file, this file is referenced in the !ResourceRef element, which holds the PID of this file, in the form of a URI. As resources can be composed of other resources, which are identified by their metadata, the !ResourceType-element specifies if the PID refers to metadata (another metadata file) or a resources such as a binary file or data. To further specify the type ResourceType takes mimetype} as an attribute, with the value specifying the mimetype of the referenced resource. Providing the mimetype is optional. !Resources can consist of more than one data streams or files, hence the !ResourceProxyList may contain more than one !ResourceProxy. To be able to refer to each of these parts individually, each !ResourceProxy receives an id-attribute for internal reference within the metadata file. === The Journal File Proxy List === For many resources that are developed over a longer period of time, changes and updates are frequent. Provenance data is not part of the CMDI-model, but it is possible to store provenance data outside of the metadata file in sensible forms. Provenance metadata is refered to as !JournalFile in CMDI documents. The !JournalFileProxyList contains the list of all !JournalFiles for a resource, the !JournalFileRef holds the URI as a reference to the !JournalFile containing the provenance data. === The Resource Relation List === Resource files do not exist independently of each other if a resource consist of more than one file. For example audio files and transcriptions are related to each other. The !ResourceProxyList only lists these files, the !ResourceRelationList makes the relation between pairs of files explicit. For this purpose the ResourceRelation contains a triple of elements defining a directed relation between a first resource source, which is referenced by a ref-pointer to an id from the !ResourceProxys and a second resource target respectively. The relation between the two is given as a string in the RelationType-element, which relations defined in a data category registry. The identifier of the Relation Type is given as dcr:datcat. === The Is-Part-of List === Resources that are defined in bundles are listed under !ResourceProxy. The individual parts can be seen as independent resources as well, such as a subcorpus that can also be distributed on its own. To point out that a resource is part of a larger unit or created as part of a larger unit, the !IsPartOfList is introduced referring to one or more larger units by referring to the PID of the larger units with the !IsPartOf-element. Potentional problem: it is (maybe intentional) unclear to what the PID points to: the resource (e.g. a landing page) or the MD (e.g. a CMDI in a repo). == The components == The components are the content section of the CMDI-files to be processed by users. The structure of the components varies according to the intentended use. In general, the components list the data categories from a data category registry in order, provides the cardinality of these data categories and possibly controlled vocabulary. Components are very varied and hence a general mechanism for describing them is more adequat than providing individual examples. The general mechanism for describing the components is using the CMDI Component Specification Language (CCSL). For the component metadata infrastructure the header and the components are described seperately. In practice it is possible to keep them seperate until the concrete schema is being generated. The instances contain the header section and the component part. For the description of the components a specification language is being used, described in the following section. = The CMDI Component Specification Language = The CMDI Component Specification Language (CCSL) is designed to describe the variable, component specific part of the CMDI schema. In a CCSL file the metadata elements are defined and grouped and other components are referenced. Figure 1 shows the relation of the individual elements of the CCSL. Figure 1 — Schematic architecture of the CMDI Component Specification Language Instances of the component specification language contain two parts, namely a header section and the component description. == CCSL header == The CCSL header provides simple data warehousing information on the component description, namely an identifier to the component description which must be unique and should be persistent (see also ISO 24619:2011), a name for the component and a description, providing a prose description of the component. == Component definition == Components are defined as a sequence of elements and can be followed by other components as components can be embedded in other components. Additionally components can take any number of attributes. These attributes and possible values are also specified in the component description. == Element definition == Elements are the part of metadata instances containing the content, i.e., the field descriptors. When introducing elements, the content model is also specified, i.e. a value scheme, which can be either a specific pattern or a closed vocabulary. == Cardinality of elements and components == For practictal considerations the cardinality of components and elements is specified according to the needs in the metadata instance. Both, elements and components can be specified as occuring for a specific number of times. It is possible to provide a lower and an upper bound for each, though the upper bound must be larger or equal to the lower bound. The cardinality can be any positive integer, 0, or unbound. == Describing multilingual content == To describe multilingual content, elements are specified with a boolean attribute for multilinguality. For elements that are specified as multilingual, conformant applications must adjust the cardinality so that such an element can be used in many languages (i.e. upper bound of the cardinality is unlimited) and allows the specification of the language of the element content by an appropriate attribute (i.e. xml:lang). == Attributes for elements and components == Besides the specification of the cardinality, the specification of components and elements both share the attributes of names and concept link. The name attribute is required to specify the name of the element in the instance, while the concept link should be used to provide an external definition of the concept behind the element or component. For those elements where a concept link cannot be provided, the documentation may be provided in prose as part of another element-attribute. It is however prefered to provide a concept link with reference to a data category registry as defined in ISO 12620:2009 For implementation purposes there is an optional attribute SupersetLabel that - when set - indicates that the content of this element should be used to identify a superset of elements by an enabled application. The value of this attribute is a numeric value used as a rank. An enabled application uses the rank only when multiple indicators to identify subsets are set, indicating which one takes priority. The highest priority is then given to the element with the rank 1; should the same rank be used multiple times, the first one in document order will receive a higher priority. For components, the component ID is provided as an attribute. This is required when a component is being used that is not specified internally but only referenced to by this identifier. In the case where a component specification includes another component specification internally, the component identfier is optional. = Transformation of CCSL into a schema = An application conforming to this standard must process the component specification language together with the static portions of header and resource section and provide an evaluative scheme for assessment of metadata instances. Various schema languages could be used, including XSchema and RelaxNG. This standard specifies how the different parts of the component specification are to be interpreted by an application creating a schema. The intended serialization of the metadata instances is valid (and well-formed) XML, which must be provided by an enabled application. Other serializations that are equivalent, for example as JSON objects, may be provided in addition to that. == Interpretation of hierarchies of the CCSL == Components are to be realized as container elements in the XML serialization, containing elements and components as specified. The name of the components or elements is provided by the name as specified in the CCSL by the respective name attribute. As XML is case sensitive, the cases of the name attribute is to be retained. The content model of an element is provided by the value scheme, i.e. a closed vocabulary or a regular expression like pattern or data type. == Interpretation of the order or elements == The specification of the elements provides the sequence of elements and components. The order of elements is fixed in general to allow for the specification of the cardinaltiy of elements. For components that contain elements and components the elements have to be specified first before the (sub-)components. == Interpretation of attributes == The CCSL allows the specification of the attributes of elements and components. The !AttributeList element of the CCSL provides the meachnism to define attributes with appropriate value schemas. An enabled application must interpret the attributes specified in a attribute list so that the parent element or component allows the attribute with exactly that name and the content model as specified by the CCSL. For semantic interoperability the CCSL provides a concept link to the external definition and description of the semantics of the attribute. The content model is provided either by the type or by the value scheme (i.e. a closed vocabulary or a regular expression like pattern). == CMDI Instance Level == {{{#!comment TODO: Write section }}} == Normative Appendix == === XML schema of the CMDI component specification language === This Annex comprises the specification of the CCSL format using the XML Schema Part 2: Datatypes syntax. This schema shall be used as a reference to check the conformity of any data represented in CCSL, so long as it does not contain any additional markup module. In any other case, the schema shall be modified to incorporate the definition of the namespaces to be associated with the external markup to be used. The schema was developed within the CLARIN-NL and is included for reference. {{{#!xml At the root level there should always be a Component. The AttributeList child of an element contains a set of XML attributes for that element. When an element is linked to a regular expression or a controlled vocabulary, the ValueScheme sub-element contains more information about this. Specification of a regular expression the element should comply with. A list of the allowed values of a controlled vocabulary. The name of the attribute. A link to the ISOcat data category registry (or any other concept registry). For the use of simple XML types as the type of the attribute. For the use of a regular expression or a controlled vocabulary as the type of the attribute. The name of the element. A link to the ISOcat data category registry (or any other concept registry). Used to specify that an element has a simple XML type (string, integer, etc) Minimal number of occurrences. Maximal number of occurrences. Some information an application (eg Arbil) can display to give guidance to the user when entering metadata. The element with the highest priority will be displayed as the label for a metadata file (eg in Arbil) Indicates that this element can have values in multiple languages (and thus is repeatable). This will result in the possibility of using the xml:lang attribute in the metadata instances that are created. Indicates that a component (using its unique ComponentId issued by the ComponentRegistry) should be included. A link to the ISOcat data category registry (or any other concept registry). Currently not used. Outdated way of including an external component. Here for backward compatibility with the XML-cmdi-toolkit. cardinality for elements and components Subset of XSD types that are allowed as CMD type controlled vocabularies An item from a controlled vocabulary. End-user guidance about the value of the controlled vocabulary as a whole. Currently not used. A link to the ISOcat data category registry (or any other concept registry) related to this controllec vocabulary item. End-user guidance about the value of this controlled vocabulary item. }}} == Non-normative Appendix == === Example CMDI instance === {{{ #!xml The following example shows an example CMDI-instance without the components.
Reinhild Barkey 2011-03-31 http://hdl.handle.net/XXXX/XXXXXXXXXXXX clarin.eu:cr1:p_1290431694580 Tübingen Language Resource Repository Thorsten Trippel 2012-01-24 Fixed encoding, added Thorsten Trippel 2012-01-24 Urgently needed example for second revision inserted
Resource http://hdl.handle.net/THERESOURCEPID1 Resource http://hdl.handle.net/THERESOURCEPID2 http://hdl.handle.net/ThePIDtoPROVENANCEfile annotates http://hdl.handle.net/SomeOtherBiggerResourceThisIsPartOf http://hdl.handle.net/SomeOtherEvenBiggerResourceThisIsPartOf ...
}}} === Example instance of a component specification === ==== General Information component specification ==== This section provides an example description of a component using the CCSL. {{{#!xml
clarin.eu:cr1:c_1290431694495 GeneralInfo Component contains general information about the resource, e.g. its name, title, the time coverage of the data, etc.
Lexicon Corpus Tool Grammar Fieldwork Material Experimental Data Survey Data Test Data Toolchain ResourceBundle planned development released production withdrawn retired superseded unknown archived published AD AE AF AG AI AL AM AN type short long
}}} = Bibliography = IETF RFC 2045, Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies IETF RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types IETF RFC 5646, Tags for Identifying Languages ISO 639‐1, Codes for the representation of names of languages — Part 1: Alpha-2 code ISO 639‐3, Codes for the representation of names of languages -- Part 3: Alpha-3 code for comprehensive coverage of languages ISO 3166‐1, Codes for the representation of names of countries and their subdivisions — Part 1: Country codes ISO 8601, Data elements and interchange formats — Information interchange — Representation of dates and times ISO/IEC 10646‐1, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane XML Schema Part 2: Datatypes, Biron, P.V. and Malhotra, A. (eds.), W3C Recommendation 02 May 2001, available at