wiki:Collaborations/Perseus

See http://perseus.tufts.edu/

Dieter spoke to Bridget Almas (Tufts University) at the 5th RDA plenary. It was agreed to create this wiki page to gather some concrete ideas of how to connect Perseus data and tools to the CLARIN infrastructure.

Some first ideas (please extend):

Possible RDA Collaboration Project:

Objective:

Leverage the PID Types API and Data Types Registry define and implement a common, interoperable model for the relationship between images, ocr scans and books.

This solves a data management problem for Perseus/OPP, and make the data and solution available to other CLARIN centers.

Secondary benefits:

  • makes Perseus/OPP Catalog data through CLARIN Federated search
  • provides a step towards interoperability between CTS URNs and Handles.

User Stories:

  • I want to be able to search for medieval German texts from between the 11th and 12th century and retrieve pictures of this book so I can feed it to (1) an OCR pipeline, (2) a transcription/crowdsourcing platform and (3) other machine learning processes.
  • I want to be able to automatically identify the differences between several versions of a picture of a page of a book (e.g. black & white, resolution, filters applied, etc.)
  • I want to be able to create a monograph using linked data where I reference a picture of the manuscript of a cited work.

Data Perseus and OPP have:

  • Images of Manuscripts and Books [ OPP ]
  • OCR Scans of Manuscripts and Books [ OPP ]
  • Catalog Metadata (MODS and MADS and CTS) [ Perseus ]

What we want to be able to do:

  • clearly distinguish between these data types
  • assign a persistent identifier to each image
  • assign a persistent identifier to each ocr scan
  • and be able to differentiate image types
  • be able to register new CTS URN identifiers for eventual TEI transcriptions resulting from scans
  • link Catalog metadata to images and ocr scans
  • make all of this data searchable and retrieval through CLARIN Federated Search endpoint

Relevant RDA Components:

  • Data Types Registry to manage and describe the data types
  • PID Types API to abstract differences between the CTS URN and Handle based identifier systems

Workplan

CLARIN:

  • CMDI mapping of Perseus Catalog metadata
  • Provide a repository and PIDs for Images and OCR Scans (implement RDA PID Types API if haven't already?)

OPP:

  • update OCR workflow to register for PIDs with CLARIN
  • implement PID Types API for CTS URNs (augments the Catalog CITE Collections API)
  • deploy demonstrator Data Types Registry (CORDRA) and register Perseus/OPP data types

Perseus:

  • OAI-PMH interface to the Perseus Catalog
  • register as Clarin center
Last modified 9 years ago Last modified on 09/25/15 15:50:15