Changes between Version 13 and Version 14 of Taskforces/FCS/Advanced DataView Example


Ignore:
Timestamp:
10/12/15 06:45:14 (9 years ago)
Author:
hanna.hedeland@uni-hamburg.de
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Taskforces/FCS/Advanced DataView Example

    v13 v14  
    480480
    481481== General Discussion ==
     482
     483Who: Hanna (HZSK)
     484
     485What: Just one (late, sorry) comment on the rather wide question of segments, words/tokens(?), "orth" etc.
     486
     487I think the two main differences between this dataview and the very similar - but more restricted - TCF of CLARIN-D is, firstly, the use of segments that are not defined as tokens/words and can thus be used to model different, overlapping segmentations (e.g. linguistic vs. time/space based) of the data, and secondly, the generic modelling of custom annotation layers, which of course still allows us to agree on basic layers that endpoints should aim to provide.
     488
     489Whereas in resources with multiple segmentations, the segments wouldn't always be words/tokens as they are in the Proposal 5, transcriptions of spoken language, even of dialogue, can behave very much like texts if you force them to, and disregard some aspects that are very important when modelling the data, but less for one dataview within a common basic search such as the FCS. The transcribed text has often been tokenized and parallell events can be roughly serialized. We then just have to remember that complex data lack accuracy, and that context ("within three words" etc.) isn't really clear-cut for spoken data (and e.g might be limited to one speaker's utterances at a time).
     490
     491So maybe we should at least consider if not just ditching the implementation of non-token segments for now, maybe to add a token layer and use this one as the base for the annotations, explicitly stating this for the layers, since this is what it is in fact right now? The "orth" and the "words" layers would't have to be redundant in the same way, and we could agree on some level of standardization for the token layer and then maybe add a layer like "diplomatic" or "non-standard" for the original transcription (as found in both transcriptions of spoken language and historic manuscripts). With explicit reference to the annotation base we could still theoretically support non-token segments and segment based layers in some distant future. And talking about distant future, since most resources do not have any aligned audio, video or other sources, it might be more convenient to add separate (annotation) layers with time-labels and links for source alignment instead of adding this to the segments of some of the resources. If a better grasp of the search results in complex data is still needed, maybe the endpoints could also at some time provide something like a custom HTML visualization, where e.g. dialogue context would become clear to the user at least when browsing the search results.
     492
    482493...