wiki:MIME format variants

This is a proposal for a common catalogue of MIME format variants to be recognized by WebLicht and eventually other CLARIN services.

It originated from a discussion across the Dev, Standards and TEIweblicht mailing lists (cf. June and July). The original proposal is due to Thomas Schmidt with feedback from Bryan Jurish.

  • ISO/TEI transcriptions of spoken language will be identified by the MIME type application/tei+xml;format-variant=tei-iso-spoken. A parameter tokenized=0/1 can be added to indicate whether (=1) or not (=0) the respective TEI file is tokenized (i.e. has <w> markup).
  • DTA/TEI files will be identified by the MIME type application/tei+xml;format-variant=tei-dta. A parameter tokenized=0/1 can be added to indicate whether (=1) or not (=0) the respective TEI file is tokenized (i.e. has <w> markup).
  • EXMARaLDA Basic Transcriptions will be identified by the MIME type application/xml; format-variant=exmaralda-exb
  • FOLKER/OrthoNormal transcription files will be identified by the MIME type application/xml; format-variant=folker-fln
  • Transcriber transcription files will be identified by the MIME type application/xml; format-variant=transcriber-trs
  • CLAN/CHAT transcription files will be identified by the MIME type text/plain;format-variant=clan-cha
  • TCF will be identified by the MIME type application/xml;format-variant=weblicht-tcf

Notes by Thomas:

  • It would have to be checked (note the passive, I don't know who could be in charge of this) whether competing MIME types for these file types are already registered somewhere. I know that WebLicht already seems to have two variants of EXMARaLDA transcriptions. The mechanims specifying those would probably have to be deprecated. Transcriber and CHAT are also not unlikely to have been given some kind of mimetype elsewhere in CLARIN.
  • Further relevant formats will be ELAN/EAF and PRAAT/TextGrid (the latter being a text, not an XML format). Both are also likely to have been registered somewhere already, so "someone" (again, I wouldn't know who) should check if mime types have been defined for those.

The parameter "charset" can be used for plain text files etc.

The above is intended to address points 1-3 in Marie Hinrichs' list below:

From WebLicht’s side, there are several places where some work/coordination needs to happen:

  1. TCF: agree on the textsource.type attribute and make sure that the encoder services set it properly
  2. Agree on type names (cf. format-variants above)
  3. Make sure the CMDI for encoder and decoder services reflect outcomes of 1 and 2
  4. Add new mappings to WebLicht for TEI.

Inventory of mime types as currently used in CLARIN centres

See https://docs.google.com/spreadsheets/d/1KeuZ-0sKiLuguJtKqRfTp0SwHYMME9cTkbuvJD_vRIg/edit?usp=sharing and feel free to comment

References

Last modified 7 years ago Last modified on 05/17/17 08:37:00