wiki:Taskforces/FCS/Advanced DataView Example

Context Navigation

Advanced FCS Data View
1. Proposals
2. General Discussion

Advanced FCS Data View

Add you examples and proposals for the Advanced FCS DataView? here.

Proposals

Proposal 1

Who: Oliver (IDS)
What: Initial Proposal 2. Ad-Hoc Stand-Off representation of multiple annotation layers.

Code:

<fcs>
    <!-- Classic Example: Peter and Paul and the hammer -->
    <data><![CDATA[Hey Paul! Would you give me the hammer?]]></data>
          <!--     1234567890123456789012345678901234567890
                   0        1         2         3         4 -->
    <layers>
        <!-- part of speech -->
        <layer type="pos">
            <span start="1"  end="3"  value="NOUN"  value-info="NN"  />
            <span start="5"  end="8"  value="PROPN" value-info="NNP" />
            <span start="9"  end="9"  value="PUNCT" value-info="."   />
            <span start="11" end="15" value="VERB"  value-info="VB"  />
            <span start="17" end="19" value="PRON"  value-info="PRP" />
            <span start="21" end="24" value="VERB"  value-info="VB"  />
            <span start="27" end="27" value="PRON"  value-info="PRP" />
            <span start="29" end="31" value="DET"   value-info="DT"  />
            <span start="33" end="38" value="NOUN"  value-info="NN"  />
            <span start="39" end="39" value="PUNCT" value-info="."   />
        </layer>

        <!-- lemma -->
        <layer type="lemma">
            <span start="1"  end="3"  value="hey"    />
            <span start="5"  end="8"  value="paul"   />
            <span start="9"  end="9"  value="!"      />
            <span start="11" end="15" value="would"  />
            <span start="17" end="19" value="you"    />
            <span start="21" end="24" value="give"   />
            <span start="27" end="27" value="me"     />
            <span start="29" end="31" value="the"    />
            <span start="33" end="38" value="hammer" />
            <span start="39" end="39" value="?"      />
        </layer>

        <!-- turns or utterances -->
        <layer type="turns">
            <span start="1"  end="27" value="turn" value-info="Peter" />
            <span start="29" end="39" value="turn" value-info="Paul"  />
        </layer>
        
         <!-- sentences -->
        <layer type="sentences">
            <span start="1"  end="9"  value="sentence" />
            <span start="11" end="39" value="sentence" />
        </layer>
    </layers>
</fcs>

Issues/Discussion?:
- a lot

Proposal for transcriptions of spoken language

Who: Hanna (HZSK) trying to squeeze transcription data into Olli's dataview proposal (here 1, should work for 2 too?)
What: A first draft for an Olli's Dataview Proposal 1 version of a small excerpt from a transcription according to the HIAT system widely used for discourse transcription, at least in Germany. It is a representation of the transcription of the conversation, i.e. of a text and not of the conversation itself, and it has been linearized though it was originally created for musical score notation. A common issue (which also applies for diachronic corpora) not represented here is the use of deviations from standard orthography ("i wanna kinda...").

Code:

<!-- a small excerpt from a transcription according to the HIAT system --> 
<fcs>
    <segments>
        <seg id="s1">Als</seg>
        <seg id="s2">o</seg>
        <seg id="s3">…</seg>
        <seg id="s4"><![CDATA[ ]]></seg>
        <seg id="s5">Sch</seg>
        <seg id="s6">ön</seg>
        <seg id="s7">.</seg>
        <seg id="s8"><![CDATA[ ]]></seg>
        <seg id="s9">Auf</seg>
        <seg id="s10"><![CDATA[ ]]></seg>
        <seg id="s11">W</seg>
        <seg id="s12">iedersehen</seg>
        <seg id="s13">.</seg>
        <seg id="s14"><![CDATA[ ]]></seg>
        <seg id="s15">Ne</seg>
        <seg id="s16">?</seg>
        <seg id="s17"><![CDATA[ ]]></seg>
        <seg id="s18">Äh</seg>
        <seg id="s19">…</seg>
        <seg id="s20"><![CDATA[ ]]></seg>
        <seg id="s21">((0,3s))</seg>
        <seg id="s22"><![CDATA[ ]]></seg>
        <seg id="s23">Öh</seg>
        <seg id="s24">˙</seg>
        <seg id="s25"><![CDATA[ ]]></seg>
    </segments>
    <layers>
        <layer type="words">
        <!-- tokens, words, ... ? -->
           <span segs="s1 s2" value="Also"/>
           <span segs="s5 s6" value="Schön"/>
           <span segs="s9" value="Auf"/>
           <span segs="s11 s12" value="Wiedersehen"/>
           <span segs="s15" value="Ne"/>
           <span segs="s18" value="Äh"/>
           <span segs="s23" value="Öh"/>
        </layer>
       <!-- part of speech -->
        <layer type="pos">
            <span segs="s1 s2" value="ADV" value-info="ADV"/>
            <span segs="s5 s6" value="ADJ" value-info="ADJD"/>
            <span segs="s9" value="ADP" value-info="APPR"/>
            <span segs="s11 s12" value="NOUN" value-info="NN"/>
            <span segs="s15" value="PART" value-info="PTKANT"/>
            <span segs="s18" value="INTJ" value-info="ITJ"/>
            <span segs="s23" value="INTJ" value-info="ITJ"/>
        </layer>
        <!-- some lemma info -->
        <layer type="lemma">
            <span segs="s1 s2" value="also"/>
            <span segs="s5 s6" value="schön"/>
            <span segs="s9" value="auf"/>
            <span segs="s11 s12" value="Wiedersehen"/>
            <span segs="s15" value="&lt;unknown&gt;"/>
            <span segs="s18" value="&lt;unknown&gt;"/>
            <span segs="s23" value="&lt;unknown&gt;"/>
        </layer>
        <!-- non-words/non-verbal - silence, noises etc. - do we have to agree on vocabulary for the types? this could be left out, 
             since the word/token layer doesn't contain non-verbal segments, but if we're serious about speech we might need it... -->
        <layer type="non-words">
            <span segs="s21 s22" value="pause" value-info="0.3"/>
        </layer>
        <!-- transcription systems differ regarding the segmentation units and many users are very rigid about using their own definitions and terms, 
             for visualization one could maybe simply display the unit boundaries and "utterance: interrupted" without interpreting it -->
        <layer type="segmentation-units">
            <span segs="s1 s2 s3 s4" value="utterance" value-info="interrupted"/>
            <span segs="s5 s6 s7 s8" value="utterance" value-info="declarative"/>
            <span segs="s9 s10 s11 s12 s13 s14" value="utterance" value-info="declarative"/>
            <span segs="s15 s16 s17" value="utterance" value-info="interrogative"/>
            <span segs="s18 s19 s20" value="utterance" value-info="interrupted"/>
            <span segs="s23 s24 s25" value="utterance" value-info="undefined"/>
        </layer>
        <!-- for some transcription conventions, there is no further segmentation into specific units, but still speakers should be recognized 
             - these spans might get huge, since pauses are often considered part of the contribution -->
        <layer type="speaker-contributions">
            <span segs="s1 s2 s3 s4" value="KLA"/>
            <span segs="s5 s6 s7 s8 s9 s10 s11 s12 s13 s14" value="AMT"/>
            <span segs="s15 s16 s17 s18 s19 s20 s21 s22 s23 s24 s25" value="KLA"/>
        </layer>
        <!-- not all segments are time-aligned, to enable audio replay there's a handle too, looks a bit repetitive though... -->
        <layer type="timed-intervals">
            <span segs="s1" value="00:02:15.000/00:02:15.400" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s2 s3 s4" value="00:02:15.400/00:02:15.500" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s5" value="00:02:15.400/00:02:15.500" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s6 s7 s8" value="00:02:15.500/00:02:15.800" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s9 s10 s11" value="00:02:15.800/00:02:16.100" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s12 s13 s14" value="00:02:16.100/00:02:16.600" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s15 s16 s17" value="00:02:15.800/00:02:16.100" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s18 s19 s20" value="00:02:16.100/00:02:16.700" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s21 s22" value="00:02:16.700/00:02:17.000" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
            <span segs="s23 s24 s25" value="00:02:17.000/00:02:17.469" value-info="http://hdl.handle.net/11022/0000-0000-506A-F@WAV"/>
        </layer>
  </layers>
</fcs>

Issues/Discussion?:
- a (normalized?) word/token layer?
- sometimes another attribute in the layer element might be good to describe i.e. what's in the value-info (e.g. original tagset name for the pos layer)
- everything concerning spoken language...

Proposal 2

Who: ljo (Språkbanken)
What: Multilayered anchorbased annotations allowing parallel annotations of the same kind/type
Code: This was merged to become proposal 5.
```

```
Issues/Discussion?:
- This can look a bit crowded so I don't expect anyone to get too excited about it, but bits and pieces might spur some discussion atleast. We use it for text corpora and anchors are only added -- never removed for persistency, so this differs a bit from our dataview perspective which is for transport only and quite short-lived.

Proposal 3

Who: Peter Beinema (MPI)
What: attempt to map CGN utterance nl000783.1 eaf-description to DataView? example

Code:

<!-- a transcription from Spoken Dutch Corpus (Corpus Gesproken Nederlands / CGN) utterance fn000783.1 containing Words, PoS, Lemma, and Phonetics. -->
<!--
     NB 1: A <data> part as mentioned by Oliver in proposal 1 is certainly possible to generate; something like it is readily available.
           Indexing on character/codepoint level is not used in CGN, however; CGN's basic unit is the Word (almost, but not exactly a token).
     NB 2: approx 10% of the utterances in CGN are phonetically transcribed in a variant of X-SAMPA. (cf. phonetics layer).
           For each Word/Token that is phonetically transcribed, start and end time in milliseconds are given in a separate TIME_ORDER 'layer',
           that contains TIME_SLOT_ID's that are referenced elsewhere.
           It is probably most elegant to integrate start/end times in the phonetics layer, and remove the TIME_ORDER layer - see below.
     NB 3: Currently, the phonetic transcriptions map to the tokens, but not the other way around.
           Searching for literal X-sampa strings will miss most occurrences of a specific orthographic token.
     NB 4: information on pauzes etc. is not available.
     NB 5: start- and end times of subsequent tokens can overlap.
     NB 6: CGN has its own PoS tag set. It can be mapped to and from UD either on a very detailed level (including attributes - thank you, Dan),
           or on a less fine-grained level (tags with as little as possible attributes to make some necessary distinctions) - see below.
-->
<fcs>
     <data><![CDATA[t da 's de enige echte hoop voor ons mensen.]]></data>
           <!--     12345678901234567890123456789012345678901234 -->
           <!--     0        1         2         3         4     -->
     <segments> LINGUISTIC_TYPE_REF="Words" PARTICIPANT="N01999">
         <seg id="fn000783.1.1" start="1" end="1" start_time="0" end-time="173">t</seg>
         <seg id="fn000783.1.2" start="3" end="4" start_time="173" end-time="304">da</seg>
         <seg id="fn000783.1.3" start="6" end="7" start_time="173" end-time="304">'s</seg>
         <seg id="fn000783.1.4" start="9" end="10" start_time="304" end-time="480">de</seg>
         <seg id="fn000783.1.5" start="12" end="16" start_time="480" end-time="1119">enige</seg>
         <seg id="fn000783.1.6" start="18" end="22" start_time="1339" end-time="1901">echte</seg>
         <seg id="fn000783.1.7" start="24" end="27" start_time="1901" end-time="2427">hoop</seg>
         <seg id="fn000783.1.8" start="29" end="32" start_time="3084" end-time="3493">voor</seg>
         <seg id="fn000783.1.9" start="34" end="36" start_time="3493" end-time="3754">ons</seg>
         <seg id="fn000783.1.10" start="38" end="43" start_time="3754" end-time="4274">mensen</seg>
     </segments>
     <layers>
         <layer type="words">
             <span segs="fn000783.1.1">t</span>
             <span segs="fn000783.1.2">da</span>
             <span segs="fn000783.1.3">'s</span>
             <span segs="fn000783.1.4">de</span>
             <span segs="fn000783.1.5">enige</span>
             <span segs="fn000783.1.6">echte</span>
             <span segs="fn000783.1.7">hoop</span>
             <span segs="fn000783.1.8">voor</span>
             <span segs="fn000783.1.9">ons</span>
             <span segs="fn000783.1.10">mensen</span>
         </layer>
         <layer type="pos">                                                              <!-- CGN <=> UD: bijective translation table -->
             <span segs="fn000783.1.1" value="X"/>                                       <!-- CGN: "SPEC(afgebr)"/> -->
             <span segs="fn000783.1.2" value="PRON"/>                                    <!-- CGN: "VNW(aanw,pron,stan,vol,3o,ev)"/> -->
             <span segs="fn000783.1.3" value="VERB"/>                                    <!-- CGN: "WW(pv,tgw,ev)"/> -->
             <span segs="fn000783.1.4" value="DET(PronType==ART)"/>                      <!-- CGN: "LID(bep,stan,rest)"/> -->
             <span segs="fn000783.1.5" value="DET(NumType undefined, PronType != Art)"/> <!-- CGN: "VNW(onbep,det,stan,prenom,met-e,rest)"/> -->
             <span segs="fn000783.1.6" value="ADJ(NumType !=Ordinal"/>                   <!-- CGN: "ADJ(prenom,basis,met-e,stan)"/> -->
             <span segs="fn000783.1.7" value="NOUN"/>                                    <!-- CGN: "N(soort,ev,basis,zijd,stan)"/> -->
             <span segs="fn000783.1.8" value="ADP"/>                                     <!-- CGN: "VZ(init)"/> -->
             <span segs="fn000783.1.9" value="PRON"/>                                    <!-- CGN: "VNW(pr,pron,obl,vol,1,mv)"/> -->
             <span segs="fn000783.1.10" value="NOUN"/>                                   <!-- CGN: "N(soort,mv,basis)"/> -->
         </layer>
         <layer type="lemma">
             <span segs="fn000783.1.1" value="_"/>
             <span segs="fn000783.1.2" value="dat"/>
             <span segs="fn000783.1.3" value="zijn"/>
             <span segs="fn000783.1.4" value="de"/>
             <span segs="fn000783.1.5" value="enig"/>
             <span segs="fn000783.1.6" value="echt"/>
             <span segs="fn000783.1.7" value="hoop"/>
             <span segs="fn000783.1.8" value="voor"/>
             <span segs="fn000783.1.9" value="ons"/>
             <span segs="fn000783.1.10" value="mens"/>
         </layer>
         <layer type="phonetics">
             <span segs="fn000783.1.1" value="t@" start_time="0" end-time="173"/>
             <span segs="fn000783.1.2" value="dAz" start_time="173" end-time="304"/>
             <span segs="fn000783.1.3" value="dAz" start_time="173" end-time="304"/>
             <span segs="fn000783.1.4" value="d@" start_time="304" end-time="480"/>
             <span segs="fn000783.1.5" value="en@G@" start_time="480" end-time="1119"/>
             <span segs="fn000783.1.6" value="Ext@" start_time="1339" end-time="1901"/>
             <span segs="fn000783.1.7" value="hop" start_time="1901" end-time="2427"/>
             <span segs="fn000783.1.8" value="for" start_time="3084" end-time="3493"/>
             <span segs="fn000783.1.9" value="Ons" start_time="3493" end-time="3754"/>
             <span segs="fn000783.1.10" value="mEns@" start_time="3754" end-time="4274"/>
         </layer>
     </layers>
</fcs>

Proposal 4

Who: Oliver (IDS) Leif-Jöran (Språkbanken)
What: Stand-off representation of MPI example

Code:

fcs>
    <layers time-unit="millis" audio-file-ref="http://hdl.handle.net/8015/1234567890">
        <layer id="http://endpoint.example.org/layers/orth1" type="orth">
            <span start="1"  end="1"  time-start="0"    time-end="173">t</span>
            <span start="3"  end="4"  time-start="173"  time-end="304">da</span> 
            <span start="6"  end="7"  time-start="173"  time-end="304">'s</span> 
            <span start="9"  end="10" time-start="304"  time-end="480">de</span> 
            <span start="12" end="16" time-start="480"  time-end="1119">enige</span> 
            <span start="18" end="22" time-start="1339" time-end="1901">echte</span> 
            <span start="24" end="27" time-start="1901" time-end="2427">hoop</span> 
            <span start="29" end="32" time-start="3084" time-end="3493">voor</span> 
            <span start="34" end="36" time-start="3493" time-end="3754">ons</span> 
            <span start="38" end="43" time-start="3754" time-end="4274">mensen</span> 
        </layer>
        <layer id="http://endpoint.example.org/layers/words1" type="words">
            <span start="1"  end="1"  time-start="0"    time-end="173">word</span>
            <span start="3"  end="4"  time-start="173"  time-end="304">word</span>
            <span start="6"  end="7"  time-start="173"  time-end="304">word</span>
            <span start="9"  end="10" time-start="304"  time-end="480">word</span>
            <span start="12" end="16" time-start="480"  time-end="1119">word</span>
            <span start="18" end="22" time-start="1339" time-end="1901">word</span>
            <span start="24" end="27" time-start="1901" time-end="2427">word</span>
            <span start="29" end="32" time-start="3084" time-end="3493">word</span>
            <span start="34" end="36" time-start="3493" time-end="3754">word</span>
            <span start="38" end="43" time-start="3754" time-end="4274">word</span>
        </layer>
        <layer id="http://endpoint.example.org/layers/pos1" type="pos" alt-value-info="CGN POS tagset">
            <span start="1"  end="1"  time-start="0"    time-end="173"  alt-value="SPEC(afgebr)">X</span>
            <span start="3"  end="4"  time-start="173"  time-end="304"  alt-value="VNW(aanw,pron,stan,vol,3o,ev)">PRON</span>
            <span start="6"  end="7"  time-start="173"  time-end="304"  alt-value="WW(pv,tgw,ev)">VERB</span>
            <span start="9"  end="10" time-start="304"  time-end="480"  alt-value="LID(bep,stan,rest)">DET</span>
            <span start="12" end="16" time-start="480"  time-end="1119" alt-value="VNW(onbep,det,stan,prenom,met-e,rest)">DET</span>
            <span start="18" end="22" time-start="1339" time-end="1901" alt-value="ADJ(prenom,basis,met-e,stan)">ADJ</span>
            <span start="24" end="27" time-start="1901" time-end="2427" alt-value="N(soort,ev,basis,zijd,stan)">NOUN</span>
            <span start="29" end="32" time-start="3084" time-end="3493" alt-value="VZ(init)">ADP</span>
            <span start="34" end="36" time-start="3493" time-end="3754" alt-value="VNW(pr,pron,obl,vol,1,mv)">PRON</span>
            <span start="38" end="43" time-start="3754" time-end="4274" alt-value="N(soort,mv,basis)">NOUN</span>
        </layer>
        <layer id="http://endpoint.example.org/layers/lemma1" type="lemma">
            <span start="1"  end="1"  time-start="0"    time-end="173">_</span>
            <span start="3"  end="4"  time-start="173"  time-end="304">dat</span>
            <span start="6"  end="7"  time-start="173"  time-end="304">zijn</span>
            <span start="9"  end="10" time-start="304"  time-end="480">de</span>
            <span start="12" end="16" time-start="480"  time-end="1119">enig</span>
            <span start="18" end="22" time-start="1339" time-end="1901">echt</span>
            <span start="24" end="27" time-start="1901" time-end="2427">hoop</span>
            <span start="29" end="32" time-start="3084" time-end="3493">voor</span>
            <span start="34" end="36" time-start="3493" time-end="3754">ons</span>
            <span start="38" end="43" time-start="3754" time-end="4274">mens</span>
        </layer>
        <layer id="http://endpoint.example.org/layers/phonetics1" type="phonetics">
            <span start="1"  end="1"  time-start="0"    time-end="173">t@</span>
            <span start="3"  end="4"  time-start="173"  time-end="304">dAz</span>
            <span start="6"  end="7"  time-start="173"  time-end="304">dAz</span>
            <span start="9"  end="10" time-start="304"  time-end="480">d@</span>
            <span start="12" end="16" time-start="480"  time-end="1119">en@G@</span>
            <span start="18" end="22" time-start="1339" time-end="1901">Ext@</span>
            <span start="24" end="27" time-start="1901" time-end="2427">hop</span>
            <span start="29" end="32" time-start="3084" time-end="3493">for</span>
            <span start="34" end="36" time-start="3493" time-end="3754">Ons</span>
            <span start="38" end="43" time-start="3754" time-end="4274">mEns@</span>
        </layer>
    </layers>
</fcs>

Discussion:
- No preferred/primary layer, all layers are equal
- Each <layer> has a @type, which denotes the type of the layer (closed controlled vocabulary)
  - If more layers of a certain type are available, the Aggregator would initially display the first.
- Each <layer> has an @id, which uniquely identifies the layer (in regard to the xml snippet); URIs are used to foster uniqueness
- Each <span> has @start and @end offset to establish the relations between other spans
- The actual "content" is stored as PCDATA in <span>. Additional data, e.g. the original tags in case of pos, can be supplied in the @alt-value attribute. Information about what is contained there must be present in @alt-value-info on <layer>.
- Optionally a <span> may carry @time-start and @time-end elements; the value type is denoted by @time-unit on <layers>. Additionally a reference to a audio-file can be supplied by @audio-file-ref on <layers>; this can be used to generate links for playback
Issues:
- Quite a lot of redundancy (especially when all segments more or less are the same across layers)
- Time info adds even more redundancy. Furthermore, it's unclear how this information is supposed to be combined for a direct playback link.
  - Possible solution: endpoints needs to provide complete and valid playback link?
- "Maker layers" (e.g. "words") only mark intervals but contain useless PCDATA
  - Possible solution: allow empty PCDATA (or better empty <span> elements in that case?
- What is the Aggregator supposed to display, if it wants/needs to display offsets? The artificial characters offsets make no real sense in the spoken data example. Nice start end offsets (e.g. 00:01:12.22/00:01:23.42) would probably be nicer.
  - Possible solution: add a display label?
- Hit-Markers (highlights) are missing

Proposal 5

Who: Oliver (IDS)
What: stand-off with common segments (derived from preceding proposal)

Code:

<fcs>
    <segments>
        <!-- NB: times are probably bogus; values should be considered as examples -->
        <segment id="s1"  start="1"  end="1" time-label="00:00:00.000/00:00:02.053"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=0:173"/>
        <segment id="s2"  start="3"  end="4" time-label="00:00:02.053/00:00:05.004"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/>
        <segment id="s3"  start="6"  end="7" time-label="00:00:02.053/00:00:05.004"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/> 
        <segment id="s4"  start="9"  end="10" time-label="00:00:05.004/00:00:07.000"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=304:480"/>
        <segment id="s5"  start="12" end="16" time-label="00:00:07.000/00:00:18.039"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=480:1119"/>
        <segment id="s6"  start="18" end="22" time-label="00:00:22.019/00:00:31.041"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1339:1901"/>
        <segment id="s7"  start="24" end="27" time-label="00:00:31.041/00:00:40.027"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1901:2427"/>
        <segment id="s8"  start="29" end="32" time-label="00:00:51.024/00:00:58.013"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3084:3493"/>
        <segment id="s9"  start="34" end="36" time-label="00:00:58.013/00:01:02.034"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3493:3754"/>
        <segment id="s10" start="38" end="43" time-label="00:01:02.034/00:01:11.074"
                 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3754:4274"/> 
    </segments>
    <layers>
        <layer id="http://endpoint.example.org/layers/orth1" type="orth">
            <span id="s1">t</span>
            <span id="s2">da</span> 
            <span id="s3">'s</span> 
            <span id="s4">de</span> 
            <span id="s5">enige</span> 
            <span id="s6">echte</span> 
            <span id="s7">hoop</span> 
            <span id="s8">voor</span> 
            <span id="s9">ons</span> 
            <span id="s10">mensen</span> 
        </layer>
        <layer id="http://endpoint.example.org/layers/words1" type="words">
            <span id="s1">word</span>
            <span id="s2">word</span>
            <span id="s3">word</span>
            <span id="s4">word</span>
            <span id="s5">word</span>
            <span id="s6">word</span>
            <span id="s7">word</span>
            <span id="s8">word</span>
            <span id="s9">word</span>
            <span id="s10">word</span>
        </layer>
        <layer id="http://endpoint.example.org/layers/pos1" type="pos" alt-value-info="CGN POS tagset">
            <span id="s1"  alt-value="SPEC(afgebr)">X</span>
            <span id="s2"  alt-value="VNW(aanw,pron,stan,vol,3o,ev)">PRON</span>
            <span id="s3"  alt-value="WW(pv,tgw,ev)">VERB</span>
            <span id="s4"  alt-value="LID(bep,stan,rest)">DET</span>
            <span id="s5"  alt-value="VNW(onbep,det,stan,prenom,met-e,rest)">DET</span>
            <span id="s6"  alt-value="ADJ(prenom,basis,met-e,stan)">ADJ</span>
            <span id="s7"  alt-value="N(soort,ev,basis,zijd,stan)">NOUN</span>
            <span id="s8"  alt-value="VZ(init)">ADP</span>
            <span id="s9"  alt-value="VNW(pr,pron,obl,vol,1,mv)">PRON</span>
            <span id="s10" alt-value="N(soort,mv,basis)">NOUN</span>
        </layer>
        <layer id="http://endpoint.example.org/layers/lemma1" type="lemma">
            <span id="s1">_</span>
            <span id="s2">dat</span>
            <span id="s3">zijn</span>
            <span id="s4">de</span>
            <span id="s5">enig</span>
            <span id="s6">echt</span>
            <span id="s7">hoop</span>
            <span id="s8">voor</span>
            <span id="s9">ons</span>
            <span id="s10">mens</span>
        </layer>
        <layer id="http://endpoint.example.org/layers/phonetics1" type="phonetics">
            <span id="s1">t@</span>
            <span id="s2">dAz</span>
            <span id="s3">dAz</span>
            <span id="s4">d@</span>
            <span id="s5">en@G@</span>
            <span id="s6">Ext@</span>
            <span id="s7">hop</span>
            <span id="s8">for</span>
            <span id="s9">Ons</span>
            <span id="s10">mEns@</span>
        </layer>
    </layers>
</fcs>

Discussion:
- Segments are listed in <segments> and may contain an optional @time-label and @ref.
  - The format of @ref is deliberately not specified on more detail, because it is highly endpoint depended. The endpoint must make sure to supply the correct URI here. Of course, handles are preferred.
  - Time-Label shall be converted to proposed time-format by endpoint. Thus, if the Aggregator displays the times, they are consistent across endpoints.
- Segments are considered "a bag of segments", so the may freely overlap. Their only purpose is to reduce redundancy in the XML serialization.
Issues:
- I guess still pretty complicated format?
- Useless PCDATA in "marker layer"
- Highlights still missing

Proposal N

Who: Name (Center)
What: short description
Code:
```

```
Discussion/Issues:
- ...

General Discussion

Who: Hanna (HZSK)

What: Just one (late, sorry) comment on the rather wide question of segments, words/tokens(?), "orth" etc.

I think the two main differences between this dataview and the very similar - but more restricted - TCF of CLARIN-D is, firstly, the use of segments that are not defined as tokens/words and can thus be used to model different, overlapping segmentations (e.g. linguistic vs. time/space based) of the data, and secondly, the generic modelling of custom annotation layers, which of course still allows us to agree on basic layers that endpoints should aim to provide.

Whereas in resources with multiple segmentations, the segments wouldn't always be words/tokens as they are in the Proposal 5, transcriptions of spoken language, even of dialogue, can behave very much like texts if you force them to, and disregard some aspects that are very important when modelling the data, but less for one dataview within a common basic search such as the FCS. The transcribed text has often been tokenized and parallell events can be roughly serialized. We then just have to remember that complex data lack accuracy, and that context ("within three words" etc.) isn't really clear-cut for spoken data (and e.g might be limited to one speaker's utterances at a time).

So maybe we should at least consider if not just ditching the implementation of non-token segments for now, maybe to add a token layer and use this one as the base for the annotations, explicitly stating this for the layers, since this is what it is in fact right now? The "orth" and the "words" layers would't have to be redundant in the same way, and we could agree on some level of standardization for the token layer and then maybe add a layer like "diplomatic" or "non-standard" for the original transcription (as found in both transcriptions of spoken language and historic manuscripts). With explicit reference to the annotation base we could still theoretically support non-token segments and segment based layers in some distant future. And talking about distant future, since most resources do not have any aligned audio, video or other sources, it might be more convenient to add separate (annotation) layers with time-labels and links for source alignment instead of adding this to the segments of some of the resources. If a better grasp of the search results in complex data is still needed, maybe the endpoints could also at some time provide something like a custom HTML visualization, where e.g. dialogue context would become clear to the user at least when browsing the search results.

...

Last modified 9 years ago Last modified on 10/15/15 09:48:33

Download in other formats:

Plain Text