Part of Speech tag sets for FCS
In search for a simple part-of-speech tagset for CLARIN Federated Content Search
The POS tag set shall be used by the human users of the FCS aggregator.
Sources
EAGLES: EAGLES Recommendations for the morphosyntactic annotation of corpora, Obligatory attributes/values, Major Categories, §4.2.1 on page 7
STTS/HW: Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset), Hauptwortarten, Tabelle 2.1 on page 4
TANL: http://medialab.di.unipi.it/wiki/Tanl_POS_Tagset#Coarse-grained_tags Tanl Coarse Grained Tags
UD-17: Universal Dependencies
UP-12: Universal POS 2nd paragraph in the right hand column of page 2
Summary table
The summary table shows the tags drawn from the tag sets quoted above (in their original spelling).
EAGLES | STTS/HW | TANL | UP-12 | UD-17 | Notes |
---|---|---|---|---|---|
N | N | S | NOUN | NOUN | noun |
PROPN | proper noun/named entity | ||||
V | V | V | VERB | VERB | verb |
AUX | auxiliary verb (includes modal auxiliaries like should or must) | ||||
AJ | ADJ | A | ADJ | ADJ | adjective |
PD | P | pronoun/determiner (as one single class) | |||
AT | ART | R | article | ||
D | DET | DET | determiner | ||
T | predeterminer (e.g., tutto il giorno) | ||||
P | PRON | PRON | pronoun | ||
AV | ADV | B | ADV | ADV | adverb |
AP | AP | E | ADP | ADP | adposition (circum-, pre-, postposition) |
C | KO | C | CONJ | CONJ | conjunction |
SCONJ | subordinating conjunction | ||||
NU | CARD | N | NUM | NUM | numeral; cardinal numeral (ordinals are tagged as adjectives or adverbs) |
I | ITJ | I | INTJ | interjection | |
U | PTK | PRT | PART | unique; particle | |
R | X | X | X | residual; other | |
SYM | symbol ($, %, §, ©, +, −, 😝, http://example.org, a@example.org) | ||||
PU | F | . | PUNCT | punctuation | |
13 | 11 | 14 | 12 | 17 | total number of tags |
Notes:
STTS/HW lacks generic tags for punctuation and other: They could be supplemented as $ (for punctuation) and X (for other)
UP-12 lacks a tag for interjection; interjections are mapped to the class X (c.f. https://code.google.com/p/universal-pos-tags/source/browse/trunk/nl-alpino.map)
UP-12 and UD-17 lack a tag for article, it is absorbed into the determiner class.
UP-12 and UD-17 have a class determiner mostly separated out of the pronoun class; this separation is completely unnatural to German speakers, compare the translation from STTS to UD-17
The use of DET is inconsistent in UD-17; e.g., Latin seems to have no determiners at all (c.f. Tagset la::conll).
TANL lacks the category particle, the Italian negation particle non is classified as an adverb.
TANL considers article, determiner, predeterminer, and pronoun as first-class citizens in parts-of-speech.
TANL has one-letter class names: elegant, but not necessarily mnemonic.
Decision
A poll was taken in the video meeting on 2015-03-09, for the result see VidConf20150309