Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#283 closed defect (fixed)

The id generator of olac2cmdi.xsl is too weak

Reported by: knappen Owned by: dietuyt
Priority: major Milestone:
Component: MetadataCuration Version:
Keywords: olac2cmdi, birthday paradox, oai-pmh Cc:

Description

olac2cmdi.xsl uses the xsl function generate-id that guarantees the uniqueness of the id within the generated cmdi file. But for the OAI-PMH function ListRecords? stronger uniqueness of the id's is required: They should be different for all listed records.

Since the transformation cannot know the id's of all other cmdi datastreams in a repository, the id needs to be a rather large random number because of the birthday paradox (see: http://en.wikipedia.org/wiki/Birthday_paradox).

Please re-categorise when I hit the wrong category.

Change History (7)

comment:1 Changed 11 years ago by larlam

In a sense it is the OAI provider's fault for taking separately created XML trees and bundling them together in a single document, and an argument could be made that nothing is wrong with the xsl transformation and that it should be the provider's responsibility to ensure no id collisions (i.e. by modifying the tree). However changing the xslt is the easier approach, so I think that's what should be done.

How about generating the id by concatenating generate-id() with a hash of the handle (or URL) of the record it appears in? This would be unique. There is an XSLT implementation of Fletcher's checksum here, which could be adapted for this.

Note I haven't tried the code I linked to. If the computation is slow it may not be practical.

comment:2 Changed 11 years ago by oschonef

Actually, the underlaying problem it wrong assumptions about XML and the interactions of things within OAI-PMH:
In the XSD for CMDI the "id" attribute on e.g. ResourceProxy is defined as xs:id, and by definition this value must to be unique within a XML instance.

The OAI providers at most centers concatenate a bunch of CMDI files, which itself have unique IDs, into a lager XML instance to build a response to a ListRecords operation. When this is done, the IDs may not be unique anymore thus schema validation will fail due to duplicate IDs. From the XML point-of-view, an OAI response is always an single XML instance.

Several solutions are possible:

  1. force centers to have center unique ids on their CMDIs ResourceProxy IDs
  2. redefine the type of attribute "id" on ResourceProxy and friends as xs:string (of course, then you'll lose the ID/IDREF stuff that is built-in into XML)
  3. OAI providers at centers must ensure, that IDs within a single OAI response are unique, e.g. with on-the-fly rewriting

Solution 1 is no real solution, because that is IMHO a rather strange requirement, that only masks the problem. So this leaves only solutions 2 and 3.

P.S. Btw, I raised that issue already a few months ago in bi-lateral talks; first during the FCS workshop in Mannheim, last in the CLARIN-D dev meeting in Hamburg ;)

Last edited 11 years ago by oschonef (previous) (diff)

comment:3 Changed 11 years ago by larlam

My earlier concatenated id suggestion falls under Oliver's option 1, but I don't believe that option will actually be feasible in the long term.

Consider the case of an aggregator that harvests metadata from various centres and makes all of it available over OAI-PMH -- something that has in fact been planned, or at least suggested, for CLARIN too. To ensure no id conflicts in its ListRecords response, all ResourceProxy id's would have to be unique over all participating repositories, or the aggregator would have to implement option 3 anyway.

So really there are two options.

comment:4 Changed 11 years ago by knappen

What is that id used for? Maybe we can drop it completely?

comment:5 Changed 11 years ago by oschonef

They are used to link Components to ResourceProxies, e.g. see the the CMDI toy example at http://www.clarin.eu/cmd/example/example-md-instance.cmdi

comment:6 Changed 11 years ago by dietuyt

Resolution: fixed
Status: newclosed

@knappen: they are used in Arbil eg and to enrich machine-readable info about ResourceProxy?'s

@larlam: if we combine the MdSelfLink? + a inter-file unique ID for the ResourceProxy? we have a uniquely guaranteed ID

@all: my response by mail below, only option is (1) but we cannot do but ask the centers, forcing them is too much

Given the OAI provider has a complete implementation for
resumptionTokens, a work-around (or better a "hack") could be, to only
disseminate *one* CMDI record per ListRecords? request. A conforming
harvester would then keep on going until the provider indicated that no
more records are left. The downside is more round-trips between
harvester and provider (= harvesting takes longer) and increased network
traffic.

Hi all,

thanks for bringing this up. In fact, the behaviour mentioned above (1 record being asked per request) is the way our OAI harvester works. So for that purpose there is no problem.

Other than that I think that:

option (2 - making the ID a string) will break existing software and takes away an important way of referencing ResourceProxy's

option (3 - rewriting OAI responses) will be unmaintainable as it requires changing some standard software components

So if one wants to provide valid ListRecords requests (n>1) I would
recommend to generate ids as follows:

new_id = xml_id_compatible_version_of(MdSelfLink) + existing_id

comment:7 Changed 11 years ago by knappen

The idea using the MdSelfLink? to generate unique identifiers does not work, because the selflink is not present in the CD or OLAC files, it is specific to the CMDI file and can only be added later!

There may be [everything in Dublin Core is optional] some presumably unique identifiers in <dc:identifier> that can be used.

Note: See TracTickets for help on using tickets.