wiki:CMDI 1.2/Schema sanity/Namespaces

This page is a subpage of CMDI 1.2

Namespace per profile

The issue

In OAI-PMH Section 3.4 metadataPrefix and Metadata Schema its made clear that to be compliant a metadata record should have an schema location that matches the URL of the schema registered for the metadataPrefix. Due to the flexible nature of CMDI there can currently be many schemata associated with the metadataPrefix, i.e., one per CMD profile.

The solutions for this issue can either based on agreements within the CLARIN community on using OAI-PMH for CMDI (solutions 1 to 3), but it can also mean changes to CMDI with regards to namespaces (other solutions).

Proposed solutions

First solution: be pragmatic

One can be pragmatic and conclude that we have been using OAI-PMH for harvesting CMDI for several years now, so this non-compliance can be ignored.

This was the preferred option in 2012, see this note.

Pros

Everything stays as it is

Cons

Non-compliance does indicate a problem, and will puzzle implementers

Centre impact

None, as nothing changes

Implementation examples

None, as nothing changes

Discussion

Oliver (IDS)?: NACK: CLARIN is about standards, interfaces and sustainability; this solution utilizes OAI-PMH in non obvious means and therefore violates CLARIN's principles. We should not do this.

Second solution: profile specific metadataPrefixes

A metadataPrefix per profile, e.g., cmdi0554, cmdi0571, cmdi2312. Each of these metadataPrefixes is linked to a different schema.

A first version of this has been implemented. The harvester can list multiple metadataPrefixes per provider endpoint. When a provider adds a new metadataPrefix this currently still requires an update of the harvester configuration to actually request the records offered for that prefix. There can be an agreed pattern in the CLARIN community, e.g., harvest every metadataPrefix starting with 'cmdi'. In that case the harvester doesn't need additional configuration but can infer the to be used metadataPrefixes itself.

Pros

Compliance, and partially implemented

Cons

Needs additional configuration per provider or CLARIN specific agreements on the use of OAI-PMH

Centre impact

The centers that currently use multiple CMD profiles but use only one cmdi metadataPrefix need to implement the metadataPrefix per profile approach

Implementation examples

None

Discussion

Oliver (IDS)?: NACK: This is rather a crude hack than a solution, because it (again) utilizes OAI-PMH in non obvious means. We should not do this.

Menzo?: Update: the harvester will change to selecting the metadata prefixes based on matching (with a regex) the metadataNamespace. This namespace will always be the same for any CMDI related metadataPrefix, even in CMDI 1.2. To make the current implementation robust it will likely be implemented as a http://www.clarin.eu/cmd/.* regex, so when 1.2 introduces a new namespace it should fit this.

Third solution: up to the centers

Leave it up to the centers to choose between the first or second solution.

Pros

If you don't care about compliance you can leave everything as it is

Cons

Mixed compliance within the CLARIN community. Still needs some additional configuration or CLARIN specific agreements on the use of OAI-PMH

Centre impact

Depends on the wanted compliance level

Implementation examples

None

Discussion

Oliver (IDS)?: NACK: Mixed compliance within CLARIN is a recipe for disaster in a (near|distant) future. We should definitely not do this.

Fourth solution: CMD envelop and payload specific schemas and namespaces

The envelope of a CMD record is fixed and described by the minimal CMD schema (TODO: needs to be synced with the latest version of the envelope generated by the CMDI XSD XSLT). We can bind this schema to the metadataPrefix and also use it in the instance. The profile specific schema would then only describe the profile specific part of the CMD record. However the namespace schema binding in xsi:schemaLocation only allows us to use a namespace once, which means we need two namespaces one for the envelope and one for the payload:

Pros

Compliance with OAI-PMH

Cons

Namespace changes for all CMD records

Centre impact

  • All tools that work with CMD records need to be changed
  • All CMD records need to be changed

Implementation examples

OAIHandler?verb=ListMetadataFormats

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2013-12-02T17:28:30Z</responseDate>
    <request verb="ListMetadataFormats"
        >http://oai.clarin-beta.dans.knaw.nl/oaicat/OAIHandler</request>
    <ListMetadataFormats>
        <metadataFormat>
            <metadataPrefix>oai_dc</metadataPrefix>
            <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
            <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
        </metadataFormat>
        <metadataFormat>
            <metadataPrefix>cmdi</metadataPrefix>
            <schema>http://infra.clarin.eu/cmd/xsd/minimal-cmdi.xsd</schema>
            <metadataNamespace>http://www.clarin.eu/cmd/envelope</metadataNamespace>
        </metadataFormat>
    </ListMetadataFormats>
</OAI-PMH>

Minimal CMDI XSD

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:cmd="http://www.clarin.eu/cmd/envelope" xmlns:dcr="http://www.isocat.org"
  targetNamespace="http://www.clarin.eu/cmd/envelope" attributeFormDefault="unqualified" elementFormDefault="qualified">
    ...
</xs:schema>

Profile schema

<xs:schema xmlns:cmd="http://www.clarin.eu/cmd/payload" xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:dcr="http://www.isocat.org/ns/dcr" xmlns:ann="http://www.clarin.eu"
  targetNamespace="http://www.clarin.eu/cmd/payload" elementFormDefault="qualified">
    ...
    <xs:element name="ToolService">
        <xs:complexType>
            <xs:sequence>
                ...
            </xs:sequence>
        </xsl:complexType>
    </xs:element>
    ...
</xsl:schema>

CMD record

<cmd-e:CMD xmlns:cmd-e="http://www.clarin.eu/cmd/envelope" xmlns:cmd-p="http://www.clarin.eu/cmd/payload"
  xsi:schemaLocation="http://www.clarin.eu/cmd/envelope http://infra.clarin.eu/cmd/xsd/minimal-cmdi.xsd
    http://www.clarin.eu/cmd/payload http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1311927752306/xsd" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2">
    <cmd-e:Header>
        ...
    </cmd-e:Header>
    ...
    <cmd-e:Components>
        <cmd-p:ToolService>
            ...
        </cmd-p:ToolService>
    </cmd-e:Components>
</cdm-e:CMD>

Discussion

Oliver (IDS)?: This is better than solution 1-3, but still has the issue of using XML namespaces in non obvious ways. The XML namespace specification section 3 Declaring Namespaces states the following on uniqueness:

The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence.

If we use the same namespace name (= URI) for different schemas, we are violating the XML namespace specification. We should not do this.

Fifth solution: profile specific payload namespaces

Same as the fourth solution but instead of a fixed namespace to be used by all profiles each profile payload gets its own namespace.

Pros

  • Compliance with OAI-PMH.
  • Unique namespaces per profile payload, which enables better default XML handling:
    • schema based object mappings are often based on the assumption that a combo of namespace and element name is unique
    • validator may cache schemas based on namespaces, with reuse of a namespace for a different profile the cache might have to be explicitly flushed

Cons

  • Namespace changes for all CMD records
  • Generic tools needs to be able to handle the diversity of namespaces, e.g., by ignoring or skipping them:
    • XPath 1.0: *[local-name()='ToolService']
    • XPath 2.0: *:ToolService

Centre impact

  • All tools that work with CMD records need to be changed
  • All CMD records need to be changed

Implementation examples

Profile schema

<xs:schema xmlns:cmd="http://www.clarin.eu/cmd/payload/clarin.eu:cr1:p_1311927752306" xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:dcr="http://www.isocat.org/ns/dcr" xmlns:ann="http://www.clarin.eu"
  targetNamespace="http://www.clarin.eu/cmd/payload" elementFormDefault="qualified">
    ...
    <xs:element name="ToolService">
        <xs:complexType>
            <xs:sequence>
                ...
            </xs:sequence>
        </xsl:complexType>
    </xs:element>
    ...
</xsl:schema>

CMD record

<cmd-e:CMD xmlns:cmd-e="http://www.clarin.eu/cmd/envelope" xmlns:cmd-p="http://www.clarin.eu/cmd/payload/clarin.eu:cr1:p_1311927752306"
  xsi:schemaLocation="http://www.clarin.eu/cmd/envelope http://infra.clarin.eu/cmd/xsd/minimal-cmdi.xsd
    http://www.clarin.eu/cmd/payload/clarin.eu:cr1:p_1311927752306 http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1311927752306/xsd" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2">
    <cmd-e:Header>
        ...
    </cmd-e:Header>
    ...
    <cmd-e:Components>
        <cmd-p:ToolService>
            ...
        </cmd-p:ToolService>
    </cmd-e:Components>
</cdm-e:CMD>

Discussion

Oliver (IDS)?: Even though this solution has the largest impact on centres, it is (IMHO) the best solution, because it is most standards compliant and (if the refercences issue is solved properly) allows a very smooth integration with OAI-PMH and off-the-shelf XML tools. The longer, we postpone this solution, the larger the pain for the centers will become, so we better make that decision now and be done with it.

Sixth solution: profile and component specific payload namespaces

Same as the fifth solution but now not only the profile gets its own namespace, but also each component.

Pros

Additional to the pros of the fifth solution:

  • reused components can be directly identified by the namespace, i.e., the @ComponentId? is incorporated in the namespace

Cons

  • Many namespace changes for all CMD records
  • Generic tools needs to be able to handle the diversity of namespaces, which can be cumbersome especially when CMDI records are created by hand, or when conversions, e.g., stylesheets, are created by hand
  • each components needs its own XSD (see below), which upsets the XSD based processing some tools, e.g., the VLO importer, do, i.e., the XSD doesn't reflect the component hierarchy directly anymore

Centre impact

  • All tools that work with CMD records need to be changed
  • All CMD records need to be changed
  • Some tools will be especially hit hard, e.g., VLO importer, conversion stylesheets

Implementation examples

Profile schema

<xs:schema
    xmlns:cmd="http://www.clarin.eu/cmd/profile/clarin.eu:cr1:p_1311927752306"
    xmlns:cmd-c1="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438123"
    xmlns:cmd-c2="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438134"
    xmlns:cmd-c3="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438125"
    xmlns:cmd-c4="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438104"
    xmlns:cmd-c5="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1280305685290"
    xmlns:dcr="http://www.isocat.org/ns/dcr"
    xmlns:ann="http://www.clarin.eu"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    targetNamespace="http://www.clarin.eu/cmd/profile/clarin.eu:cr1:p_1311927752306"
    elementFormDefault="qualified">
    <xs:import namespace="http://www.w3.org/XML/1998/namespace"
        schemaLocation="http://www.w3.org/2001/xml.xsd"/>
    <xs:import namespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438123" schemaLocation="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:p_1271859438123/xsd"/>
    <xs:import namespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438134" schemaLocation="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:p_1271859438134/xsd"/>
    <xs:import namespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438125" schemaLocation="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:p_1271859438125/xsd"/>
    <xs:import namespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438104" schemaLocation="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:p_1271859438104/xsd"/>
    <xs:import namespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1280305685290" schemaLocation="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:p_1280305685290/xsd"/>
    <xs:element name="ToolService">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="toolType" dcr:datcat="http://www.isocat.org/datcat/DC-3810"
                    minOccurs="0" maxOccurs="unbounded" ann:displaypriority="1">
                    <xs:complexType>
                        <xs:simpleContent>
                            <xs:extension base="xs:string"/>
                        </xs:simpleContent>
                    </xs:complexType>
                </xs:element>
                <xs:element name="applicationType" dcr:datcat="http://www.isocat.org/datcat/DC-3786"
                    minOccurs="1" maxOccurs="unbounded">
                    <xs:complexType>
                        <xs:simpleContent>
                            <xs:extension base="xs:string"/>
                        </xs:simpleContent>
                    </xs:complexType>
                </xs:element>
                <xs:element name="website" dcr:datcat="http://www.isocat.org/datcat/DC-2546"
                    minOccurs="0" maxOccurs="1">
                    <xs:complexType>
                        <xs:simpleContent>
                            <xs:extension base="xs:string"/>
                        </xs:simpleContent>
                    </xs:complexType>
                </xs:element>
                <xs:element ref="cmd-c1:GeneralInfo" minOccurs="1" maxOccurs="1"/>
                <xs:element ref="cmd-c2:Creators" minOccurs="1" maxOccurs="1"/>
                <xs:element ref="cmd-c3:Project" minOccurs="1" maxOccurs="1"/>
                <xs:element ref="cmd-c4:Country" minOccurs="1" maxOccurs="1"/>
                <xs:element ref="cmd-c5:Documentation" minOccurs="0" maxOccurs="unbounded"/>
                <xs:element name="Tool" minOccurs="0" maxOccurs="unbounded">
                ...
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xsl:schema>

One of many component schemas

<xs:schema xmlns:cmd="http://www.clarin.eu/cmd/" xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:dcr="http://www.isocat.org/ns/dcr" xmlns:ann="http://www.clarin.eu"
    xmlns:cmd-c1="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438118"
    targetNamespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438123"
    elementFormDefault="qualified">
    <xs:import namespace="http://www.w3.org/XML/1998/namespace"
        schemaLocation="http://www.w3.org/2001/xml.xsd"/>
    <xs:import namespace="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438118" schemaLocation="http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/components/clarin.eu:cr1:p_1271859438118/xsd"/>
    <xs:element name="GeneralInfo">
        <xs:complexType>
            <xs:sequence>
                ...
                <xs:element ref="cmd-c1:Description" minOccurs="0" maxOccurs="1"/>
            </xs:sequence>
            <xs:attribute name="ref"/>
            <xs:attribute name="ComponentId" type="xs:anyURI" fixed="clarin.eu:cr1:c_1271859438123"/>
        </xs:complexType>
    </xs:element>
</xs:schema>

CMD record

<cmd:CMD 
    xmlns:cmd="http://www.clarin.eu/cmd/"
    xmlns:cmd-p="http://www.clarin.eu/cmd/profile/clarin.eu:cr1:p_1311927752306"
    xmlns:cmd-c1="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438123"
    xmlns:cmd-c2="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438118"
    xmlns:cmd-c3="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438134"
    xmlns:cmd-c4="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438125"
    xmlns:cmd-c5="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438113"
    xmlns:cmd-c6="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438116"
    xmlns:cmd-c7="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438104"
    xmlns:cmd-c8="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1280305685290"
    xmlns:cmd-c9="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438130"
    xmlns:cmd-c10="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438111"
    xmlns:cmd-c11="http://www.clarin.eu/cmd/component/clarin.eu:cr1:c_1271859438110"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="
        http://www.clarin.eu/cmd/ CMDI-envelope.xsd
        http://www.clarin.eu/cmd/payload ToolService-profile.xsd
    " CMDVersion="1.2">
    <cmd:Header>
        ...
    </cmd:Header>
    <cmd:Resources>
        <cmd:ResourceProxyList>
            ...
        </cmd:ResourceProxyList>
        <cmd:JournalFileProxyList/>
        <cmd:ResourceRelationList/>
    </cmd:Resources>
    <cmd:Components>
        <cmd-p:ToolService>
            <cmd-p:toolType>lemmatizer</cmd-p:toolType>
            <cmd-p:applicationType>web application</cmd-p:applicationType>
            <cmd-p:applicationType>web service</cmd-p:applicationType>
            <cmd-p:website>http://adelheid.ruhosting.nl/</cmd-p:website>
            <cmd-c1:GeneralInfo>
                <cmd-c1:Name>Adelheid</cmd-c1:Name>
                <cmd-c2:Description>
                    <cmd-c2:Description>A web-application with which an end user can have historical
                        Dutch text tokenized, lemmatized and part-of-speech tagged, using the most appropriate resources
                        (such as lexica) for the text in question. The need to consistently use appropriate resources leads to the
                        intuitively obvious strategy of placing this service in the Clarin infrastructure. For each specific text,
                        the user can then select the best resources from those available in Clarin, wherever they might reside,
                        and where necessary supplemented by own lexica.</cmd-c2:Description>
                </cmd-c2:Description>
            </cmd-c1:GeneralInfo>
            <cmd-c3:Creators>
                <cmd-c3:Creator>
                    <cmd-c3:Role>Technology Provider</cmd-c3:Role>
                    <cmd-c3:Role>User</cmd-c3:Role>
                    <cmd-c3:Contact>
                        <cmd-c3:Person>Dr. Hans van Halteren</cmd-c3:Person>
                        <cmd-c3:Address>Centre for Language and Speech Technology, P.O. Box 9103, 6500 HD Nijmegen, Netherlands</cmd-c3:Address>
                        <cmd-c3:Email>hvh@let.ru.nl</cmd-c3:Email>
                        <cmd-c3:Organisation>Radboud University Nijmegen</cmd-c3:Organisation>
                        <cmd-c3:Telephone>+31 - 24 361 2836</cmd-c3:Telephone>
                    </cmd-c3:Contact>
                </cmd-c3:Creator>
                <cmd-c3:Creator>
                    <cmd-c3:Role>Data Provider</cmd-c3:Role>
                    <cmd-c3:Role>User</cmd-c3:Role>
                    <cmd-c3:Contact>
                        <cmd-c3:Person>Dr. Margit Rem</cmd-c3:Person>
                        <cmd-c3:Address>Dept. of Dutch, P.O. Box 9103, 6500 HD Nijmegen, Netherlands</cmd-c3:Address>
                        <cmd-c3:Email>M.Rem@let.ru.nl</cmd-c3:Email>
                        <cmd-c3:Organisation>Radboud University Nijmegen</cmd-c3:Organisation>
                        <cmd-c3:Telephone>+31 - 24 361 2899</cmd-c3:Telephone>
                    </cmd-c3:Contact>
                </cmd-c3:Creator>
                <cmd-c3:Creator>
                    <cmd-c3:Role>Clarin centre representative</cmd-c3:Role>
                    <cmd-c3:Contact>
                        <cmd-c3:Person>Drs. D. Broeder</cmd-c3:Person>
                        <cmd-c3:Address>Wundtlaan 1, 6525 XD  Nijmegen, The Netherlands</cmd-c3:Address>
                        <cmd-c3:Email>Daan.Broeder@mpi.nl</cmd-c3:Email>
                        <cmd-c3:Organisation>Max Planck Institute for Psycholinguistics (MPI)</cmd-c3:Organisation>
                        <cmd-c3:Telephone>+31 - 24 - 3521103</cmd-c3:Telephone>
                    </cmd-c3:Contact>
                </cmd-c3:Creator>
            </cmd-c3:Creators>
            <cmd-c4:Project>
                <cmd-c4:Name>Adelheid</cmd-c4:Name>
                <cmd-c4:Title>A Distributed Lemmatizer for Historical Dutch</cmd-c4:Title>
                <cmd-c4:Funder>CLARIN-NL</cmd-c4:Funder>
                <cmd-c4:Website>http://adelheid.ruhosting.nl/</cmd-c4:Website>
                <cmd-c2:Description>
                    <cmd-c2:Description LanguageID="en">This project aims at providing a web-application with which an end user can have historical
                        Dutch text tokenized, lemmatized and part-of-speech tagged, using the most appropriate resources
                        (such as lexica) for the text in question. The need to consistently use appropriate resources leads to the
                        intuitively obvious strategy of placing this service in the Clarin infrastructure. For each specific text,
                        the user can then select the best resources from those available in Clarin, wherever they might reside,
                        and where necessary supplemented by own lexica. During the project a demonstrator for the
                        distributed automatic lemmatization will be created, with some 14th century charters as test texts as
                        well as corresponding resources.</cmd-c2:Description>
                </cmd-c2:Description>
                <cmd-c5:Contact>
                    <cmd-c5:Person>Dr. Hans van Halteren</cmd-c5:Person>
                    <cmd-c5:Address>Centre for Language and Speech Technology, P.O. Box 9103, 6500 HD Nijmegen, Netherlands</cmd-c5:Address>
                    <cmd-c5:Email>hvh@let.ru.nl</cmd-c5:Email>
                    <cmd-c5:Organisation>Radboud University Nijmegen</cmd-c5:Organisation>
                    <cmd-c5:Telephone>+31 - 24 361 2836</cmd-c5:Telephone>
                </cmd-c5:Contact>
                <cmd-c6:Duration>
                    <cmd-c6:StartYear>2010</cmd-c6:StartYear>
                    <cmd-c6:CompletionYear>2012</cmd-c6:CompletionYear>
                </cmd-c6:Duration>
            </cmd-c4:Project>
            <cmd-c7:Country>
                <cmd-c7:Code>NL</cmd-c7:Code>
            </cmd-c7:Country>
            <cmd-c8:Documentation>
                <cmd-c8:URL>http://adelheid.ruhosting.nl/</cmd-c8:URL>
                <cmd-c9:DocumentationLanguages>
                    <cmd-c10:Language>
                        <cmd-c10:LanguageName>English</cmd-c10:LanguageName>
                        <cmd-c11:ISO639>
                            <cmd-c11:iso-639-3-code>eng</cmd-c11:iso-639-3-code>
                        </cmd-c11:ISO639>
                    </cmd-c10:Language>
                </cmd-c9:DocumentationLanguages>
            </cmd-c8:Documentation>
            <cmd-p:Tool ref="h0">...</cmd-p:Tool>
        </cmd-p:ToolService>
    </cmd:Components>
</cmd:CMD>

Discussion

Menzo Windhouwer (TLA)? As the example shows one has to deal with potentially many namespaces. Tools can manage, with some effort, e.g., analyzing the XSD or CMD XML profile/component document. Handwritten conversions will already be quite cumbersome and potentially hard to maintain, as for each XML element the namespace is potentially different.

Some tools, e.g., the VLO importer, analyze the nested xs:element statements in the XSD to find paths to certain facets based on the ISOcat data category. This analysis will become much harder due to the imports now needed for the component namespaces. Regardless if depending on the way the XSD is generated is such a good idea, the impact is too severe for the CMDI 1.2 upgrade.

Tickets

Tickets in the CMDI 1.2 milestone with the keyword keyword:

Ticket Summary Owner Component Priority Status
No tickets found

Discussion

Discuss the topic in general below this point


The goal of XML namespaces is to provide a clean partition the "name universe" for XML. The XML Namespace recommendation states:

"We envision applications of Extensible Markup Language (XML) where a single XML document may contain elements and attributes (here referred to as a "markup vocabulary") that are defined for and used by multiple software modules. .. Such documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the elements and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element name or attribute name." Namespaces in XML 1.0, Section 1 "Motivation and Summary"

The overall approach of CMDI is to define a set of (metadata) components, that get compiled into profiles. Authors of components are free to choose the (inner) structure and the names for their components on their own. Two independent metadata authors can easily come up with Profiles, that contain components that uses the same name, but a different structure, i.e. one Creator component, that just has no further inner structure and another Creator component, that is further structured in Name, Organization and Email.

The CMDI profiles will be compiled into XML schema documents and components become XML elements. In the current implementation, CMDI puts all XML elements into one generic CMDI XML Namespace, i.e. the two Creator elements are identified by the same QName. However, these two Creators conceptually different things, cause they have a different structure and thus, their XML representation have conflicting content models. As long as XML instances created from these XML Schema are not used in together in one context, things work. But if used together, this is a recipe for problems. For example consider the following scenarios:

  • XML Parsers: an XML Parser parsing and validating a batch of CMDI instances from various profiles can cache an internal representation of a parsed XML schema based on XML Namespaces to speed up the processing. However, if it first caches an XML schema, that contains no "Creator" elements, it will reject all instances, that happen to contain "Creator"s. Or, if it caches the "flat" Creator, it will reject the structured Creator.
    For example, Xerces-J has the following comment in org.apache.xerces.impl.xs.XMLSchemaValidator.java:1564 ff

"store the external schema locations. they are set when reset is called, so any other schemaLocation declaration for the same namespace will be effectively ignored. because we choose to take first location hint available for a particular namespace."

The comment is in the reset() method of the SchemaValidator. This hints, that the parser might cache schemas based on XML Namespace names, if instances of the parser get re-used.

This is confirmed by the http://xerces.apache.org/xerces2-j/faq-grammars.html#faq-2?, which states:

In XML Schemas, hashing is simply carried out on the target namespace of the schema. Thus, two grammars are considered equal (by our default implementation) if and only if their XMLGrammarDescriptions are instances of org.apache.xerces.impl.xs.XSDDescription (our schema implementation of XMLGrammarDescription) and the targetNamespace fields of those objects are identical.

Oliver created a validation tool, which uses the Xerces validator and caching. Running this on a set of mixed CMDI records, i.e., using different profiles, exposes the problem:

$ mvn exec:java Dexec.args=/.../cmdi/
[DEBUG] done pre-caching grammar(s)
[DEBUG] processing directory: /.../cmdi
[DEBUG] processing directory: /.../cmdi/A_Digital_Archive_of_Research_Papers_in_Computational_Linguistics
[DEBUG] processing file: /.../cmdi/A_Digital_Archive_of_Research_Papers_in_Computational_Linguistics/oai_acl_sr_language_archives_org_A00_1000.xml
...
[DEBUG] ... stop after parse limit of 10 file(s)
[DEBUG] processing directory: /.../cmdi/Academia_Sinica_Balanced_Corpus_of_Modern_Chinese
[DEBUG] processing file: /.../cmdi/Academia_Sinica_Balanced_Corpus_of_Modern_Chinese/oai_sinica_edu_tw_EarlyMandarin.xml
...
[DEBUG] ... stop after parse limit of 10 file(s)
...
[DEBUG] processing directory: /.../cmdi/Bayerisches_Archiv_f_r_Sprachsignale
[DEBUG] processing file: /.../cmdi/Bayerisches_Archiv_f_r_Sprachsignale/oai_BAS_repo_Center_BAS.xml
[ERROR] cvc-complex-type.2.4.a: Invalid content was found starting with element 'CenterProfile'. One of '{"http://www.clarin.eu/cmd/":OLAC-DcmiTerms}' is expected. [line=105, column=24]

This run shows that validation succeeds while all the records use the same CMD profile, i.e., OLAC-DcmiTerms. Than a record using the CenterProfile is hit and the validator complains about the root element, i.e., its not OLAC-DcmiTerms. However, oai_BAS_repo_Center_BAS.xml is in fact a valid CMDI file.

Another test carried out with (schema-aware) Saxon-EE yielded similar results, because Saxon likewise caches the XML schema based on the XML namespace. It shows similar issues as Xerces-J and fails to validate CMDI instances from different profiles. (NB: We cannot provide the demo for download, because Saxon-EE is a commercial product and requires a license).

  • XML Databases: native XML Databases might also run into a the XML schema caching issue, if validation is turned on. The popular eXist-db uses Xerces-J under the hood, so the above mentioned issue with Xerces-J may also automatically lead to problems with eXist-db.

This is confirmed by turning on validation in eXist-db and loading two different CMDI records. The first one, which uses the OLAC-DcmiTerms profile, loads fine and leads to caching of its XSD. The second one uses the WebLicht profile can't be imported due to a validation error:

eXist-db can't import a valid CMD record due to XSD caching

  • Using CMDI in other contexts, i.e. embedded in other protocols. One example is OAI-PMH. CLARIN requires centers to make their metadata available for harvesting by means of OAI-PMH. OAI-PMH required to link each metadataPrefix to an unique XML schema Section "3.4 metadataPrefix and Metadata Schema": it defined "The XML namespace URI that is a global identifier of the metadata format" and "The metadata schema URL - the URL of an XML schema to test validity of metadata expressed according to the format". As long as centers only use one profile, they are fine. However, if their repository contains CMDI records from various profiles, it's harder to select an appropriate schema to announce via OAI-PMH.

It turns out that caching the XSD Schema based on the namespace is correct behavior inline with the XSD W3C Recommendation:

"Schema Representation Constraint: Schema Document Location Strategy Given a namespace name (or none) and (optionally) a URI reference from xsi:schemaLocation or xsi:noNamespaceSchemaLocation, schema-aware processors may implement any combination of the following strategies, in any order:

  1. Do nothing, for instance because a schema containing components for the given namespace name is already known to be available, or because it is known in advance that no efforts to locate schema documents will be successful (for example in embedded systems);
  2. Based on the location URI, identify an existing schema document, either as a resource which is an XML document or a <schema> element information item, in some local schema repository;
  3. Based on the namespace name, identify an existing schema document, either as a resource which is an XML document or a <schema> element information item, in some local schema repository;
  4. Attempt to resolve the location URI, to locate a resource on the web which is or contains or references a <schema> element;
  5. Attempt to resolve the namespace name to locate such a resource.

Whenever possible configuration and/or invocation options for selecting and/or ordering the implemented strategies should be provided." XML Schema Part 1: Structures Second Edition - Section 4.3.2 How schema definitions are located on the web

Strategy 3 corresponds with the use of schema cache with the namespace as the key, as provided by Xerces-J and Saxon EE.

To avoid these problems at least the fifth solution, i.e., profile specific namespaces, would be needed. Solution six, profile and component specific namespaces, raises the complexity of namespaces too much. Solution five actually uses components like chameleon schema modules are used in in XSD, i.e., the imported elements are absorbed in the namespace of the importing schema. The drawback is that the fact that the same element is used by different vocabularies is lost. However, the CMD Infrastructure has its own solutions for that:

  1. The root elements of the components and profiles have a fixed, but optional,ComponentId attributes allowing to identify shared components;
  2. All levels in a CMD record can be associated with a ConceptLink, which can even indicate shared semantics between various components.

The cleanest solution to avoid these problems it thus to have a general CMDI XML namespace for the wrapper elements of an CMDI instance (Header, ...) and a profile specific namespace for the elements below the Components elements. The following approaches could be used to lessen the impact of this change for users who don't care for Namespaces:

  • ignore Namespaces in XPath:
    • in XPath 1.0: *[local-name()='ToolService’]
    • in XPath 2.0: *:ToolService"
  • use a "flattener XSLT": this stylesheet will put all elements into the same namespace (and deliberately breaking validation). Users can than use their traditional XPath expressions. However, "flattened" CMDI instances SHOULD only be transient contexts, i.e. they MUST NOT be used for long term storage or exchange. (Menzo has prepared such a style-sheet. Thanks!)
  • use a SAX filter to change incoming namespaces (here is an example for adding such a filter to JAXB)
Last modified 10 years ago Last modified on 04/01/14 12:13:19

Attachments (4)

Download all attachments as: .zip