wiki:OAIHarvester

Version 3 (modified by Menzo Windhouwer, 9 years ago) (diff)

Filled in some more sections.

Responsible for this page: Menzo Windhouwer?.
Last content check: 28-10-2015

Purpose

The purpose of this page is to collect relevant information about the OAI Harvester.

Project: OAI Harvester

The OAI Harvester manages the regular harvests of CMD records from endpoints provided by the CLARIN centers and additionally harvests of OLAC and DC records from various other endpoints.


Contents

  1. Project: OAI Harvester
    1. People
    2. Getting code
    3. Usage
    4. System Requirements
    5. Dependencies
    6. Building and Deploying
    7. Interfaces
    8. Design
    9. Tickets
    10. Status, Planning and Roadmap
    11. Resources
    12. History


People


Getting code


Usage

The deployment package contains a script to start the app, run-harvester.sh (for Unix systems including Mac OS X; we can add a Windows batch file if anyone wants it). The simplest usage is:

run-harvester.sh config.xml

where config.xml is the configuration file you wish to use. Additionally, parameters can be defined on the command line. For example:

run-harvester.sh timeout=30 config.xml

will set the connection timeout to 30 seconds. This value will override the timeout value defined in config.xml, if any. The first parameter that does not contain = is taken as the configuration file name.

Configuration

The behaviour of the app is determined by a single configuration file. The configuration file is composed of four sections:

  • settings, where options such as directory paths and timeouts are set;
  • directories, where output paths are defined;
  • actions, the most complex section, where actionSequences of actions can be defined for different metadata formats (actions include semantic transformations and saving intermediary or final results into a file); and
  • providers, where endpoints for the providers to be harvested are listed.

To get a clear idea of the structure of the configuration file, see the sample configuration files.


System Requirements


Dependencies

This application does not itself contain an implementation of the OAI-PMH protocol; it uses the OCLC harvester2 library for OAI-PMH requests.


Building and Deploying

mvn clean package assembly:assembly

The above build process creates a package named oai-harvest-manager-x.y.z.tar.gz (where x.y.z is a version number). This package can be deployed where needed.


Interfaces


Design


Tickets

https://github.com/TheLanguageArchive/oai-harvest-manager/issues


Status, Planning and Roadmap

Status: active

Planning and roadmap:

  • switch to the ListRecords scenario
  • get rid of OCLC harvester2 library, which prevents timeouts etc. for specific endpoints
  • get rid of always building a DOM, which blows up memory consumption
  • create a new OAI harvester viewer

Resources


History

  • Kees Jan van de Looij
  • Lari Lampen