[[PageOutline(1-3)]] = Advanced FCS Data View = Add you examples and proposals for the Advanced FCS DataView here. == Proposals == === Proposal 1 === * Who: Oliver (IDS) * What: Initial ''Proposal 2''. Ad-Hoc Stand-Off representation of multiple annotation layers. * Code: {{{ #!xml }}} * Issues/Discussion: * a lot === Proposal for transcriptions of spoken language === * Who: Hanna (HZSK) trying to squeeze transcription data into Olli's dataview proposal (here 1, should work for 2 too?) * What: A first draft for an Olli's Dataview Proposal 1 version of a small excerpt from a transcription according to the HIAT system widely used for discourse transcription, at least in Germany. It is a representation of the transcription of the conversation, i.e. of a text and not of the conversation itself, and it has been linearized though it was originally created for musical score notation. A common issue (which also applies for diachronic corpora) not represented here is the use of deviations from standard orthography ("i wanna kinda..."). * Code: {{{ #!xml Als o Sch ön . Auf W iedersehen . Ne ? Äh ((0,3s)) Öh ˙ }}} * Issues/Discussion: * a (normalized?) word/token layer? * sometimes another attribute in the layer element might be good to describe i.e. what's in the value-info (e.g. original tagset name for the pos layer) * everything concerning spoken language... === Proposal 2 === * Who: ljo (Språkbanken) * What: Multilayered anchorbased annotations allowing parallel annotations of the same kind/type * Code: This was merged to become proposal 5. {{{ #!xml }}} * Issues/Discussion: * This can look a bit crowded so I don't expect anyone to get too excited about it, but bits and pieces might spur some discussion atleast. We use it for text corpora and anchors are only added -- never removed for persistency, so this differs a bit from our dataview perspective which is for transport only and quite short-lived. === Proposal 3 === * Who: Peter Beinema (MPI) * What: attempt to map CGN utterance nl000783.1 eaf-description to DataView example * Code: {{{ #!xml LINGUISTIC_TYPE_REF="Words" PARTICIPANT="N01999"> t da 's de enige echte hoop voor ons mensen t da 's de enige echte hoop voor ons mensen }}} === Proposal 4 === * Who: Oliver (IDS) Leif-Jöran (Språkbanken) * What: Stand-off representation of MPI example * Code: {{{ #!xml fcs> t da 's de enige echte hoop voor ons mensen word word word word word word word word word word X PRON VERB DET DET ADJ NOUN ADP PRON NOUN _ dat zijn de enig echt hoop voor ons mens t@ dAz dAz d@ en@G@ Ext@ hop for Ons mEns@ }}} * Discussion: * No preferred/primary layer, all layers are equal * Each {{{}}} has a {{{@type}}}, which denotes the type of the layer (closed controlled vocabulary) * If more layers of a certain type are available, the Aggregator would initially display the first. * Each {{{}}} has an {{{@id}}}, which uniquely identifies the layer (in regard to the xml snippet); URIs are used to foster uniqueness * Each {{{}}} has {{{@start}}} and {{{@end}}} offset to establish the relations between other spans * The actual "content" is stored as PCDATA in {{{}}}. Additional data, e.g. the original tags in case of pos, can be supplied in the {{{@alt-value}}} attribute. Information about what is contained there must be present in {{{@alt-value-info}}} on {{{}}}. * Optionally a {{{}}} may carry {{{@time-start}}} and {{{@time-end}}} elements; the value type is denoted by {{{@time-unit}}} on {{{}}}. Additionally a reference to a audio-file can be supplied by {{{@audio-file-ref}}} on {{{}}}; this can be used to generate links for playback * Issues: * Quite a lot of redundancy (especially when all segments more or less are the same across layers) * Time info adds even more redundancy. Furthermore, it's unclear how this information is supposed to be combined for a direct playback link. * Possible solution: endpoints needs to provide complete and valid playback link? * "Maker layers" (e.g. "words") only mark intervals but contain useless PCDATA * Possible solution: allow empty PCDATA (or better empty {{{}}} elements in that case? * What is the Aggregator supposed to display, if it wants/needs to display offsets? The artificial characters offsets make no real sense in the spoken data example. Nice start end offsets (e.g. 00:01:12.22/00:01:23.42) would probably be nicer. * Possible solution: add a display label? * Hit-Markers (highlights) are missing === Proposal 5 === * Who: Oliver (IDS) * What: stand-off with common segments (derived from preceding proposal) * Code: {{{ #!xml t da 's de enige echte hoop voor ons mensen word word word word word word word word word word X PRON VERB DET DET ADJ NOUN ADP PRON NOUN _ dat zijn de enig echt hoop voor ons mens t@ dAz dAz d@ en@G@ Ext@ hop for Ons mEns@ }}} * Discussion: * Segments are listed in {{{}}} and may contain an optional {{{@time-label}}} and {{{@ref}}}. * The format of {{{@ref}}} is deliberately not specified on more detail, because it is highly endpoint depended. The endpoint must make sure to supply the correct URI here. Of course, handles are preferred. * Time-Label shall be converted to proposed time-format by endpoint. Thus, if the Aggregator displays the times, they are consistent across endpoints. * Segments are considered "a bag of segments", so the may freely overlap. Their only purpose is to reduce redundancy in the XML serialization. * Issues: * I guess still pretty complicated format? * Useless PCDATA in "marker layer" * Highlights still missing === Proposal N === * Who: Name (Center) * What: short description * Code: {{{ #!xml }}} * !Discussion/Issues: * ... == General Discussion == Who: Hanna (HZSK) What: Just one (late, sorry) comment on the rather wide question of segments, words/tokens(?), "orth" etc. I think the two main differences between this dataview and the very similar - but more restricted - TCF of CLARIN-D is, firstly, the use of segments that are not defined as tokens/words and can thus be used to model different, overlapping segmentations (e.g. linguistic vs. time/space based) of the data, and secondly, the generic modelling of custom annotation layers, which of course still allows us to agree on basic layers that endpoints should aim to provide. Whereas in resources with multiple segmentations, the segments wouldn't always be words/tokens as they are in the Proposal 5, transcriptions of spoken language, even of dialogue, can behave very much like texts if you force them to, and disregard some aspects that are very important when modelling the data, but less for one dataview within a common basic search such as the FCS. The transcribed text has often been tokenized and parallell events can be roughly serialized. We then just have to remember that complex data lack accuracy, and that context ("within three words" etc.) isn't really clear-cut for spoken data (and e.g might be limited to one speaker's utterances at a time). So maybe we should at least consider if not just ditching the implementation of non-token segments for now, maybe to add a token layer and use this one as the base for the annotations, explicitly stating this for the layers, since this is what it is in fact right now? The "orth" and the "words" layers would't have to be redundant in the same way, and we could agree on some level of standardization for the token layer and then maybe add a layer like "diplomatic" or "non-standard" for the original transcription (as found in both transcriptions of spoken language and historic manuscripts). With explicit reference to the annotation base we could still theoretically support non-token segments and segment based layers in some distant future. And talking about distant future, since most resources do not have any aligned audio, video or other sources, it might be more convenient to add separate (annotation) layers with time-labels and links for source alignment instead of adding this to the segments of some of the resources. If a better grasp of the search results in complex data is still needed, maybe the endpoints could also at some time provide something like a custom HTML visualization, where e.g. dialogue context would become clear to the user at least when browsing the search results. ...