Opened 8 years ago

Last modified 8 years ago

#974 assigned defect

Provider - http://weblicht.sfs.uni-tuebingen.de/oaiprovider/

Reported by: tomasz.naskret@pwr.edu.pl Owned by: Marie Hinrichs
Priority: minor Milestone:
Component: Harvesting Version:
Keywords: OAI Cc: Menzo Windhouwer

Description

All data records are generated with same data stamp.
This behavior prevents incremental harvest.

Attachments (1)

proai-1.1.3.jar (131.9 KB) - added by Menzo Windhouwer 8 years ago.
patched proai-1.1.3.jar

Download all attachments as: .zip

Change History (7)

comment:1 Changed 8 years ago by Dieter Van Uytvanck

Owner: changed from Menzo.Windhouwer@mpi.nl to Marie Hinrichs
Status: newassigned

Hi Marie, could you have a look at this - would it be something that can be changed easily?

comment:2 Changed 8 years ago by Marie Hinrichs

I will look into it once people are back from vacation.

comment:3 Changed 8 years ago by Marie Hinrichs

We looked at the date stamps and they are correct, so we are not sure exactly what the problem is. Our proai tables only get regenerated if we delete a record, which is rare - the last time was in January 2016. Otherwise, the dates seem to be updated correctly as far as we can tell.

Can you provide some more information?

Thanks.

comment:4 Changed 8 years ago by Dieter Van Uytvanck

Cc: Menzo Windhouwer added

Menzo is checking the default prOAI behaviour and will report back.

comment:5 Changed 8 years ago by Menzo Windhouwer

My DO in FC has as date:

2016-09-08T11:06:04.427Z

The FC oaiprovider requests the right information from the FC resource index:

  <result>
    <item uri="info:fedora/lat:1839_00_0000_0000_0001_367F_7"/>
    <itemID>oai:flat.example.com:lat:1839_00_0000_0000_0001_367F_7</itemID>
    <date datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-09-08T11:06:04.427Z</date><state uri="info:fedora/fedora-system:def/model#Active"/>
  </result>

This gets propery stored by Proai in its cache:

  <record>
    <header>
      <identifier>oai:flat.example.com:lat:1839_00_0000_0000_0001_367F_7</identifier>
      <datestamp>2016-09-08T11:06:04Z</datestamp>
    </header>
    <metadata>
      <CMD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">...</CMD>
    </metadata>
  </record>

In its bookkeeping Proai stores a timestamp of the poll in its database:

         1 |       1 |         1 | 1473332872624 | 2016/09/08/11/07/45.729.xml

The epoch 1473332872624 is Thu, 08 Sep 2016 11:07:52.624 GMT, which is what we see if we request the record via OAI:

  <header>
    <identifier>oai:flat.example.com:lat:1839_00_0000_0000_0001_367F_7</identifier>
    <datestamp>2016-09-08T11:07:52Z</datestamp>
  </header>

While delivering the cached record Proai replaces the right datestamp by the one it stored in its database:

https://github.com/fcrepo3/proai/blob/v1.1.3/src/java/proai/cache/CachedContent.java#L80

The why seems to be lost in the mist of time. I can experiment with a Proai fork where we disable this line ...

Version 0, edited 8 years ago by Menzo Windhouwer (next)

Changed 8 years ago by Menzo Windhouwer

Attachment: proai-1.1.3.jar added

patched proai-1.1.3.jar

comment:6 Changed 8 years ago by Menzo Windhouwer

I've created a fork of Proai 1.1.3, which doesn't overwrite the cached modification date:

https://github.com/menzowindhouwer/proai/commit/870b2c759afdcca6fa49458e59c5b2e607ed8123

(I'll attach the proai-1.1.3.jar, which can just replace the JAR in tomcat/webapps/oaiprovider/WEB-INF/lib/ directory)

However, this might be only a partial solution as the from/until OAI query will still be evaluated against the poll timestamp in the Proai database.

But is the use of the poll timestamp really a problem for incremental harvesting?

  • t1: record 1 is created
  • t2: record 2 is created
  • t3: record 3 is created
  • t4: Proai comes by and sees the created records 1, 2 and 3
  • t5: a full harvests gets records 1, 2 and 3 from Proai with datestamp t4
  • t6: record 3 is updated
  • t7: Proai comes by and sees the updated record 3
  • t8: an incremental harvests requests records since t5, so Proai delivers record 3 with datestamp t7

Using the poll timestamp just a means to manage the deltas seems fine to me and allowed by OAI-PMH: https://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvestingandDatestamps

So, I think the incremental harvest will work fine with Proai as it is. But one can use this patch if the OAI record datestamp should contain the actual modification time of the record instead of the poll timestamp.

Note: See TracTickets for help on using tickets.