Changes between Version 11 and Version 12 of Collaborations/Europeana


Ignore:
Timestamp:
01/27/16 12:41:20 (8 years ago)
Author:
Dieter Van Uytvanck
Comment:

added feedback to raw data, table of contents

Legend:

Unmodified
Added
Removed
Modified
  • Collaborations/Europeana

    v11 v12  
    11== our contact persons ==
     2[[PageOutline(1-5)]]
    23
    34Christian Thomas, Susanne Haaf, Axel Herold (BBAW), Dieter Van Uytvanck
     
    2829 * Important is that we avoid 2 unnecessary metadata conversion (CMDI > EDM > CMDI), so harvest the source CMDI from the CLARIN centres directly
    2930
    30 == information from Nuno Freire ==
     31== Details OCRed news papers ==
    3132
    32 {{{
     33=== information from Nuno Freire ===
    3334
    3435Meanwhile, maybe you can have a look at what we have in Github. It contains only one newspaper title, but you can see the metadata available and the fulltext (as plain text files).
     
    3637
    3738Some more notes:
    38 - We will not continue using Github (so don't expect to have the materials available through Git)
    39 - The metadata formats in there are the final available ones: Dublin Core, EDM (Europeana Data Model - a richer format used in the Europeana network)
    40 - If we find the storage capacity, we will make available also the ALTO files, along with the plain text.
    41 }}}
     39* We will not continue using Github (so don't expect to have the materials available through Git)
     40* The metadata formats in there are the final available ones: Dublin Core, EDM (Europeana Data Model - a richer format used in the Europeana network)
     41* If we find the storage capacity, we will make available also the ALTO files, along with the plain text.
     42
     43
     44=== Feedback ===
     45
     46by Christian & Susanne, BBAW:
     47
     48
     49==== About the full texts ====
     50
     51* One version of the data should be provided containing as much structured text as possible (e.g. provide the respective OCR source file in e.g. ALTO format or the like)
     52* The full text download is not very attractive and hard to handle:
     53 
     54(a) The text has just been sliced in parts at places which are not plausible considering the text context (sentences are cut in half ...). Is there any reason why the text of one issue is not provided as a whole?
     55
     56(b) The metadata are provided in separate documents which are not part of the zip-file containing the text data. Thus the full text can accidentally get detached from its metadata. Best practice would be to keep the separate metadata records, but also include the most important metadata, or at least a link to the complete metadata record within the text file of each issue (see pt. a).
     57
     58(c) Is there a possibility to batch-download the whole corpus at once?
     59(Downloading each zip per issue is impractical).
     60
     61* Recognition quality is *very* poor! Especially the Latvian texts seem to be a huge problem, but the German texts as well. This concerns the older as well as the younger issues. Here, maybe it would help to at least have the OCR-source texts for more information on confidentiality of the OCR etc.
     62* Apart from the general problem of recognition quality additional problems are evoked by the lack of tagging, e.g. in berliner_tageblatt_19270111_1.txt tokens are separated per character by whitespace, supposedly because the original text was printed in spaced print.
     63* It would be nice to have text-image-links within the full text which lead to the original source image per text page.
     64
     65==== About the metadata records ====
     66Additional information needed on:
     67* Publication: date of creation of the digital text, date of its publication
     68* Source images: Where are they accessible, how can I get access?
     69* Information about the source: which copy of the text source was used, where is it kept (library, shelfmark, PPN, URN) --> should be provided for each issue and each collection
     70* Digitization procedure: Which OCR software has been used? Which version, with which parameters ...? Were there quality improvement methods applied to the text after OCR recognition (if so, which)? etc.
     71* Information about the corpus in general: Is it complete, i.e. does it consist of all issues the newspaper has ever published or only of a certain part? If the latter, which subcollection was chosen of which total?
     72* (nice to have:) more information about the respective issue, i.e.
     73(in addition to the publication date which is already given in the current MD records) the issue number, publication place, publisher ...
     74
     75==== General remark ====
     76* A documentation on the metadata records in general would be good to have, where the purpose of those metadata fields is further explained, which might be difficult to understand. For instance, in http://data.theeuropeanlibrary.org/download/newspapers/berliner_tageblatt/metadata_berliner_tageblatt.dc.xml
     77what is the purpose of the link in
     78//tel:BibliographicResource/dc:identifier[3] ?
    4279
    4380== more information ==