Changes between Version 3 and Version 4 of Collaborations/Europeana/NewspaperOCR


Ignore:
Timestamp:
01/24/17 09:10:49 (7 years ago)
Author:
Twan Goosen
Comment:

updated info from Clemens

Legend:

Unmodified
Added
Removed
Modified
  • Collaborations/Europeana/NewspaperOCR

    v3 v4  
    11= Details OCRed news papers =
    22For context, see the [[..|Europeana page]].
     3
     4== Information from Clemens Neudecker ==
     5By e-mail ([mailto:clemens.neudecker@europeana-newspapers.eu]), January 2017
     6
     7* All the full-text content is in ALTO XML format (see [http://www.loc.gov/standards/alto/ 2a] and [https://github.com/altoxml 2b]), a LoC standard, and while we have extracted also plain text for various post-processing things (and done some basic crosswalks to TEI), much can be gained by using the source ALTO XML files for further processing, as there is a lot of useful information (string coordinates, confidence of word/character recognition, layout information, etc.) that gets lost in the flat text or in TEI. Still, as the plain text can be useful for some applications, most has been released now via http://research.europeana.eu/itemtype/newspapers.
     8
     9* Due to size/performance constraints, the ALTO XML will become available via Europeana only with the launch of the newspapers API (different from main Europeana API). But if you/CLARIN would like to have access to the full text files in ALTO earlier, we can arrange this bilaterally, as e.g. our (Berlin State Library) whole data (i.e. images, ALTO XML and METS) have been released into the public domain already via Wikimedia - though it's around 6 TB or 700GB of compressed XML if you strip the images, so let me know if this or any subset thereof would be of value to you at this stage.
     10
     11* [Here] is a small subset of ALTO XML files from the National Library of the Netherlands that we used for creating a CRF classifier: http://lab.kbresearch.nl/static/data/eunews_nl.alto.zip
    312
    413== information from Nuno Freire ==