wiki:Taskforces/FCS/FCS POS tag set

Part of Speech tag sets for FCS

In search for a simple part-of-speech tagset for CLARIN Federated Content Search

The POS tag set shall be used by the human users of the FCS aggregator.

Sources

EAGLES: EAGLES Recommendations for the morphosyntactic annotation of corpora, Obligatory attributes/values, Major Categories, §4.2.1 on page 7

STTS/HW: Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset), Hauptwortarten, Tabelle 2.1 on page 4

TANL: http://medialab.di.unipi.it/wiki/Tanl_POS_Tagset#Coarse-grained_tags Tanl Coarse Grained Tags

UD-17: Universal Dependencies

UP-12: Universal POS 2nd paragraph in the right hand column of page 2

Summary table

The summary table shows the tags drawn from the tag sets quoted above (in their original spelling).

EAGLES STTS/HW TANL UP-12 UD-17 Notes
N N S NOUN NOUN noun
PROPN proper noun/named entity
V V V VERB VERB verb
AUX auxiliary verb (includes modal auxiliaries like should or must)
AJ ADJ A ADJ ADJ adjective
PD P pronoun/determiner (as one single class)
AT ART R article
D DET DET determiner
T predeterminer (e.g., tutto il giorno)
P PRON PRON pronoun
AV ADV B ADV ADV adverb
AP AP E ADP ADP adposition (circum-, pre-, postposition)
C KO C CONJ CONJ conjunction
SCONJ subordinating conjunction
NU CARD N NUM NUM numeral; cardinal numeral (ordinals are tagged as adjectives or adverbs)
I ITJ I INTJ interjection
U PTK PRT PART unique; particle
R X X X residual; other
SYM symbol ($, %, §, ©, +, −, 😝, http://example.org, a@example.org)
PU F . PUNCT punctuation
13 11 14 12 17 total number of tags

Notes:

STTS/HW lacks generic tags for punctuation and other: They could be supplemented as $ (for punctuation) and X (for other)

UP-12 lacks a tag for interjection; interjections are mapped to the class X (c.f. https://code.google.com/p/universal-pos-tags/source/browse/trunk/nl-alpino.map)

UP-12 and UD-17 lack a tag for article, it is absorbed into the determiner class.

UP-12 and UD-17 have a class determiner mostly separated out of the pronoun class; this separation is completely unnatural to German speakers, compare the translation from STTS to UD-17

The use of DET is inconsistent in UD-17; e.g., Latin seems to have no determiners at all (c.f. Tagset la::conll).

TANL lacks the category particle, the Italian negation particle non is classified as an adverb.

TANL considers article, determiner, predeterminer, and pronoun as first-class citizens in parts-of-speech.

TANL has one-letter class names: elegant, but not necessarily mnemonic.

Decision

A poll was taken in the video meeting on 2015-03-09, for the result see VidConf20150309

Last modified 9 years ago Last modified on 03/11/15 17:16:11