wiki:Collaborations/Europeana/NewspaperOCR

Version 2 (modified by Twan Goosen, 8 years ago) (diff)

--

Details OCRed news papers

For context, see ...

information from Nuno Freire

Meanwhile, maybe you can have a look at what we have in Github. It contains only one newspaper title, but you can see the metadata available and the fulltext (as plain text files). https://github.com/nfreire/HistoricalNewspapersCorpus/

Some more notes:

  • We will not continue using Github (so don't expect to have the materials available through Git)
  • The metadata formats in there are the final available ones: Dublin Core, EDM (Europeana Data Model - a richer format used in the Europeana network)
  • If we find the storage capacity, we will make available also the ALTO files, along with the plain text.

Feedback

by Christian & Susanne, BBAW:

About the full texts

  • One version of the data should be provided containing as much structured text as possible (e.g. provide the respective OCR source file in e.g. ALTO format or the like)
  • The full text download is not very attractive and hard to handle:

(a) The text has just been sliced in parts at places which are not plausible considering the text context (sentences are cut in half ...). Is there any reason why the text of one issue is not provided as a whole?

(b) The metadata are provided in separate documents which are not part of the zip-file containing the text data. Thus the full text can accidentally get detached from its metadata. Best practice would be to keep the separate metadata records, but also include the most important metadata, or at least a link to the complete metadata record within the text file of each issue (see pt. a).

(c) Is there a possibility to batch-download the whole corpus at once? (Downloading each zip per issue is impractical).

  • Recognition quality is *very* poor! Especially the Latvian texts seem to be a huge problem, but the German texts as well. This concerns the older as well as the younger issues. Here, maybe it would help to at least have the OCR-source texts for more information on confidentiality of the OCR etc.
  • Apart from the general problem of recognition quality additional problems are evoked by the lack of tagging, e.g. in berliner_tageblatt_19270111_1.txt tokens are separated per character by whitespace, supposedly because the original text was printed in spaced print.
  • It would be nice to have text-image-links within the full text which lead to the original source image per text page.

About the metadata records

Additional information needed on:

  • Publication: date of creation of the digital text, date of its publication
  • Source images: Where are they accessible, how can I get access?
  • Information about the source: which copy of the text source was used, where is it kept (library, shelfmark, PPN, URN) --> should be provided for each issue and each collection
  • Digitization procedure: Which OCR software has been used? Which version, with which parameters ...? Were there quality improvement methods applied to the text after OCR recognition (if so, which)? etc.
  • Information about the corpus in general: Is it complete, i.e. does it consist of all issues the newspaper has ever published or only of a certain part? If the latter, which subcollection was chosen of which total?
  • (nice to have:) more information about the respective issue, i.e.

(in addition to the publication date which is already given in the current MD records) the issue number, publication place, publisher ...

General remark

what is the purpose of the link in tel:BibliographicResource/dc:identifier[3] ?