[=#topofpage]

{{{
#!comment

----
This template is meant to provide a structure for new project pages. Feedback is welcome.

'''Howto:''' Start editing below the next line. Keep section headings but delete all text notes as you go along filling in the actual information. Finally, delete this bit here.

'''Notes.''' This template contains many notes like this paragraph. These are meant to be deleted. I made it this way instead of using Trac comments for simplicity. This structure is not binding: you should leave out sections that do not apply, and add new sections if you think there is a clear reason to do so. Some formatting is also optional; e.g. this template has a line (horizontal rule) above each heading to keep them visually more distinct. You may choose to keep the lines, use empty space instead, or leave them out; all conventions are already in use in the Trac wiki.

'''The actual wiki page begins below:'''
----
}}}


''Responsible for this page: [[mwindhouwer|Menzo Windhouwer]].''\\
''Last content check: 28-10-2015''

{{{
#!html
<h3>Purpose</h3>
}}}

The purpose of this page is to collect relevant information about the OAI Harvester.


= Project: OAI Harvester =

The OAI Harvester manages the regular harvests of CMD records from endpoints provided by the CLARIN centers and additionally harvests of OLAC and DC records from various other endpoints.

{{{
#!comment
----
== Subpages ==

If there are subpages to this page, uncomment this section and add links these pages.
}}}

----
{{{
#!comment
This section can be skipped for short pages.
}}}

{{{
#!html
<h3>Contents</h3>
}}}

[[PageOutline(1-2, , inline)]]
----
== People ==

* [[mwindhouwer|Menzo Windhouwer]]

----
== Getting code ==

 * You can browse the code [https://github.com/TheLanguageArchive/oai-harvest-manager]
 * Git clone from: {{{https://github.com/TheLanguageArchive/oai-harvest-manager.git}}}

----
== Usage ==

The deployment package contains a script to start the app, `run-harvester.sh` (for Unix systems including Mac OS X; we can add a Windows batch file if anyone wants it). The simplest usage is:

{{{
run-harvester.sh config.xml
}}}

where `config.xml` is the configuration file you wish to use. Additionally, parameters can be defined on the command line. For example:

{{{
run-harvester.sh timeout=30 config.xml
}}}

will set the connection timeout to 30 seconds. This value will override the timeout value defined in `config.xml`, if any. The first parameter that does not contain `=` is taken as the configuration file name.

=== Configuration ===

The behaviour of the app is determined by a single configuration file. The configuration file is composed of four sections:

 * ''settings'', where options such as directory paths and timeouts are set;
 * ''directories'', where output paths are defined;
 * ''actions'', the most complex section, where actionSequences of actions can be defined for different metadata formats (actions include semantic transformations and saving intermediary or final results into a file); and
 * ''providers'', where endpoints for the providers to be harvested are listed.

To get a clear idea of the structure of the configuration file, see the sample configuration files.

----
== System Requirements ==


----
== Dependencies ==

This application does not itself contain an implementation of the OAI-PMH protocol; it uses the [https://code.google.com/p/oaiharvester2/ OCLC harvester2 library] for OAI-PMH requests.

----
== Building and Deploying ==

{{{
mvn clean package assembly:assembly
}}}

The above build process creates a package named oai-harvest-manager-x.y.z.tar.gz (where x.y.z is a version number). This package can be deployed where needed.

----
== Interfaces ==

----
== Design ==

{{{
#!comment
Internal design of the project; class diagrams etc.
}}}

----
== Tickets ==

[https://github.com/TheLanguageArchive/oai-harvest-manager/issues]

----
== Status, Planning and Roadmap ==

Status: active

Planning and roadmap: 
* switch to the !ListRecords scenario, where batches of records are requested from the providers
* get rid of OCLC harvester2 library, which prevents specific timeouts etc. per endpoint
* get rid of always building a DOM, which blows up memory consumption
* create a new OAI harvester viewer

=== The new OAI harvester viewer ===

This new viewer should provide some advantages over the current viewer:

* paged listing of records harvested
* jump from records to the OAI request it originates from (in the !ListRecords scenario a request can contain multiple records which cannot be handled by the current viewer)
* keep some statistics over past runs, so a warning can be send when a 'sudden' drop in the number of records is experienced
* can also provide access to the archived harvests

Additionaly the viewer can also be an access point for tools to assess the quality of the CMD records:
* run XSD validation on the records/a record
* run Schematron rules agains the records/a record, e.g., to check against best practices
* run the VLO importer to see if the records/a record would be included in the VLO and which facet values it would deliver
* check the profiles used, e.g., in CMDI 1.2 one could check if deprecated profiles are used or now already how well they cover the VLO facets
* calculate a quality score (see [http://www.lrec-conf.org/proceedings/lrec2014/pdf/1011_Paper.pdf LREC 2014 paper])

These tools could by run by default on all records or allow to select a specific record to check, but also allow the upload of a record.

More in the OAI domain we could also trigger a run against a OAI validator, e.g, [http://validator.oaipmh.com/], and/or allow to trigger a harvest for a specific endpoint. The latter might need a specific setup/installation to not interfere with the periodic CLARIN harvest.

----
== Resources ==

{{{
#!comment
Link to (external) documents, e.g. documentation, papers, requirement analyses, relevant to this project in this section.
}}}

----
== History ==

* Kees Jan van de Looij
* Lari Lampen