wiki:Collaborations/Europeana/NewspaperOCR

Details OCRed news papers

For context, see the Europeana page.

Information from Clemens Neudecker

By e-mail (clemens.neudecker@europeana-newspapers.eu), January 2017

  • All the full-text content is in ALTO XML format (see 2a and 2b), a LoC standard, and while we have extracted also plain text for various post-processing things (and done some basic crosswalks to TEI), much can be gained by using the source ALTO XML files for further processing, as there is a lot of useful information (string coordinates, confidence of word/character recognition, layout information, etc.) that gets lost in the flat text or in TEI. Still, as the plain text can be useful for some applications, most has been released now via http://research.europeana.eu/itemtype/newspapers.
  • Due to size/performance constraints, the ALTO XML will become available via Europeana only with the launch of the newspapers API (different from main Europeana API). But if you/CLARIN would like to have access to the full text files in ALTO earlier, we can arrange this bilaterally, as e.g. our (Berlin State Library) whole data (i.e. images, ALTO XML and METS) have been released into the public domain already via Wikimedia - though it's around 6 TB or 700GB of compressed XML if you strip the images, so let me know if this or any subset thereof would be of value to you at this stage.

information from Nuno Freire

Meanwhile, maybe you can have a look at what we have in Github. It contains only one newspaper title, but you can see the metadata available and the fulltext (as plain text files). https://github.com/nfreire/HistoricalNewspapersCorpus/

Some more notes:

  • We will not continue using Github (so don't expect to have the materials available through Git)
  • The metadata formats in there are the final available ones: Dublin Core, EDM (Europeana Data Model - a richer format used in the Europeana network)
  • If we find the storage capacity, we will make available also the ALTO files, along with the plain text.

Feedback

by Christian & Susanne, BBAW:

About the full texts

  • One version of the data should be provided containing as much structured text as possible (e.g. provide the respective OCR source file in e.g. ALTO format or the like)
  • The full text download is not very attractive and hard to handle:

(a) The text has just been sliced in parts at places which are not plausible considering the text context (sentences are cut in half ...). Is there any reason why the text of one issue is not provided as a whole?

(b) The metadata are provided in separate documents which are not part of the zip-file containing the text data. Thus the full text can accidentally get detached from its metadata. Best practice would be to keep the separate metadata records, but also include the most important metadata, or at least a link to the complete metadata record within the text file of each issue (see pt. a).

(c) Is there a possibility to batch-download the whole corpus at once? (Downloading each zip per issue is impractical).

  • Recognition quality is *very* poor! Especially the Latvian texts seem to be a huge problem, but the German texts as well. This concerns the older as well as the younger issues. Here, maybe it would help to at least have the OCR-source texts for more information on confidentiality of the OCR etc.
  • Apart from the general problem of recognition quality additional problems are evoked by the lack of tagging, e.g. in berliner_tageblatt_19270111_1.txt tokens are separated per character by whitespace, supposedly because the original text was printed in spaced print.
  • It would be nice to have text-image-links within the full text which lead to the original source image per text page.

About the metadata records

Additional information needed on:

  • Publication: date of creation of the digital text, date of its publication
  • Source images: Where are they accessible, how can I get access?
  • Information about the source: which copy of the text source was used, where is it kept (library, shelfmark, PPN, URN) --> should be provided for each issue and each collection
  • Digitization procedure: Which OCR software has been used? Which version, with which parameters ...? Were there quality improvement methods applied to the text after OCR recognition (if so, which)? etc.
  • Information about the corpus in general: Is it complete, i.e. does it consist of all issues the newspaper has ever published or only of a certain part? If the latter, which subcollection was chosen of which total?
  • (nice to have:) more information about the respective issue, i.e.

(in addition to the publication date which is already given in the current MD records) the issue number, publication place, publisher ...

General remark

what is the purpose of the link in tel:BibliographicResource/dc:identifier[3] ?

Last modified 7 years ago Last modified on 01/24/17 09:10:49