Version 5 (modified by 9 years ago) (diff) | ,
---|
CLARIN Federated Content Search Query Language
A working draft for the CQP flavor for CLARIN Federated Content Search (FCS).
ENBF
[1] query ::= main-query within-part? [2] main-query ::= simple-query | "(" main-query ")" /* grouping */ | main-query "|" main-query /* or */ | main-query main-query /* sequence */ | main-query quantifier /* quatification */ [3] simple-query ::= implicit-query | segment-query [4] implicit-query ::= flagged-regexp [5] segment-query ::= "[" expression? "]" [6] within-part ::= simple-within-part [7] simple-within-part ::= "within" simple-within-scope [8] simple-within-scope ::= "sentence" | "s" | "utterance" | "u" | "paragraph" | "p" | "turn" | "t" | "text" | "session" [11] expression ::= basic-expression | expression "|" expression /* or */ | expression "&" expression /* and */ | "(" expression ")" /* grouping */ | "!" expression /* not */ [12] basic-expression ::= attribute operator flagged-regexp [13] operator ::= "=" /* equals */ | "!=" /* non-equals */ [14] quantifier ::= "+" /* one-or-more */ | "*" /* zero-or-more */ | "?" /* zero-or-one */ | "{" integer "}" /* exactly n-times */ | "{" integer? "," integer "}" /* at most */ | "{" integer "," integer? "}" /* min-max */ [15] flagged-regexp ::= regexp | regexp "/" regexp-flag+ [16] regexp-flag ::= "i" /* case-insensitive; Poliqarp/Perl compat */ | "I" /* case-sensitive; Poliqarp compat */ | "c" /* case-insensitive, CQP compat */ | "C" /* case-sensitive */ | "l" /* literal matching, CQP compat*/ | "d" /* diacritic agnostic matching, CQP compat */ [17] regexp ::= quoted-string [18] attribute ::= simple-attribute | qualified-attribute [19] simple-attribute ::= identifier [20] qualified-attribute ::= identifier ":" identifier [21] identifier ::= identifier-char identifier-char* [22] identifier-char ::= [a-zA-Z0-9\-] [24] integer ::= [0-9]+ [26] quoted-string ::= "'" (char | ws)* "'" /* single-quotes */ | """ (char | ws)* """ /* double-quotes */ [27] char ::= <any unicode codepoint excluding whitespace codepoints> | "\" escaped-char [28] ws ::= <any whitespace codepoint> [29] escaped-char ::= "\" /* backslash (\) */ | "'" /* single quote (') */ | """ /* double quote (") */ | "n" /* generic newline, i.e "\n", "\r", etc */ | "t" /* character tabulation (U+0009) */ | "x" hex hex /* Unicode codepoint with hex value hh */ | "u" hex hex hex hex /* Unicode codepoint with hex value hhhh */ | "U" hex hex hex hex hex hex hex hex /* Unicode codepoint with hex value hhhhhhhh */ [30] hex ::= [0-9a-fA-F]
Notes
- based on Poliqarp with inspiration from others
- "attribute": the annotation layer to be used, e.g. "word", "lemma", "pos" or qualified "pos:stts" the supported values for this construct are beyond the grammar and need to be defined in supplementary documents
- "simple-within-scope": possible values for scope
- "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.")
- "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph
- "article" | "session": something like a whole document
[27]
and[28]
"any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with[29]
:/- regex are not defined/guarded by this grammar :/
- non-continuous rule numbers are currently intended; we've already removed some. Rules will be renumbered, when grammar is fixed.
Discussion
Peter Beinema, MPI, proposes some minor changes to the grammar:
- main query [2] / simple query [3]: above definition generate structural ambiguity. Not a problem for ANTLR (which selects the right-recursive solution), but some other parser generators generate all solutions - which are exponential wrt the number of main queries. I propose to use the alternative rules given below.
- above rule [2] can generate infinite array of quantifiers: "word" +*{23}(,17}? would be legal.
- rule 5: option marker '?' makes "[]" a valid query. Propose to remove question mark.
- expression [11] / basic expression [12]: structural ambiguity. See proposed alternative below.
// rules presented in antlr format //--- [2v2] no left recursion or ambiguity, 1 (optional) quantifier main_query : simple_query | simple_query '|' main_query // 'or' | simple_query main_query // sequence | simple_query quantifier // quantification ; //--- [3v2] simple_query : '(' main_query ')' | implicit_query | segment_query ; //--- [5v2] 'expression no longer optional segment_query : '[' expression ']' ; //--- [11v2] expression : basic_expression | basic_expression '|' expression // or | basic_expression '&' expression // and ; //--- [12v2] basic_expression : '(' expression ')' // grouping | '!' expression // not | attribute operator flagged_regexp ;
Attachments (1)
-
FCS_QL_2.g4 (3.7 KB) - added by 9 years ago.
antlr (version 4.5) grammar for FCS-QL query
Download all attachments as: .zip