Version 6 (modified by 4 years ago) (diff) | ,
---|
Responsible for this page: Menzo Windhouwer?.
Last content check: 28-10-2015
Purpose
The purpose of this page is to collect relevant information about the OAI Harvester.
Project: OAI Harvester
The OAI Harvester manages the regular harvests of CMD records from endpoints provided by the CLARIN centers and additionally harvests of OLAC and DC records from various other endpoints.
Contents
People
- Twan Goosen
- Menzo Windhouwer
- Tomasz Naskret(?)
Getting code
- You can browse the code https://github.com/clarin-eric/oai-harvest-manager
- Git clone from:
git@github.com:clarin-eric/oai-harvest-manager.git
Usage
The deployment package contains a script to start the app, run-harvester.sh
(for Unix systems including Mac OS X; we can add a Windows batch file if anyone wants it). The simplest usage is:
run-harvester.sh config.xml
where config.xml
is the configuration file you wish to use. Additionally, parameters can be defined on the command line. For example:
run-harvester.sh timeout=30 config.xml
will set the connection timeout to 30 seconds. This value will override the timeout value defined in config.xml
, if any. The first parameter that does not contain =
is taken as the configuration file name.
Configuration
The behaviour of the app is determined by a single configuration file. The configuration file is composed of four sections:
- settings, where options such as directory paths and timeouts are set;
- directories, where output paths are defined;
- actions, the most complex section, where actionSequences of actions can be defined for different metadata formats (actions include semantic transformations and saving intermediary or final results into a file); and
- providers, where endpoints for the providers to be harvested are listed.
To get a clear idea of the structure of the configuration file, see the sample configuration files.
System Requirements
Dependencies
This application does not itself contain an implementation of the OAI-PMH protocol; it uses the OCLC harvester2 library for OAI-PMH requests.
Building and Deploying
mvn clean package assembly:assembly
The above build process creates a package named oai-harvest-manager-x.y.z.tar.gz (where x.y.z is a version number). This package can be deployed where needed.
Interfaces
Design
Tickets
https://github.com/clarin-eric/oai-harvest-manager/issues
Status, Planning and Roadmap
Status: active
Planning and roadmap:
- create a new OAI harvester viewer
The new OAI harvester viewer
This new viewer should provide some advantages over the current viewer:
- paged listing of records harvested
- jump from records to the OAI request it originates from (in the ListRecords scenario a request can contain multiple records which cannot be handled by the current viewer)
- keep some statistics over past runs, so a warning can be send when a 'sudden' drop in the number of records is experienced
- can also provide access to the archived harvests
Additionaly the viewer can also be an access point for tools to assess the quality of the CMD records, i.e., tools from the Metadata Curation TF.
More in the OAI domain we could also trigger a run against a OAI validator, e.g, http://validator.oaipmh.com/, and/or allow to trigger a harvest for a specific endpoint. The latter might need a specific setup/installation to not interfere with the periodic CLARIN harvest.
Resources
History
- Kees Jan van de Looij
- Lari Lampen