Version 11 (modified by 9 years ago) (diff) | ,
---|
CLARIN Federated Content Search Query Language
A working draft for the CQP flavor for CLARIN Federated Content Search (FCS).
EBNF
[1] query ::= main-query within-part? [2] main-query ::= simple-query | "(" main-query ")" /* grouping */ | main-query "|" main-query /* or */ | main-query main-query /* sequence */ | main-query quantifier /* quatification */ [2.11] main-query ::= simple-query | simple-query "|" main-query /* or */ | simple-query main-query /* sequence */ | simple-query quantifier /* quatification */ [3] simple-query ::= implicit-query | segment-query [3.11] simple-query ::= '(' main_query ')' /* grouping */ | implicit-query | segment-query [4] implicit-query ::= flagged-regexp [5] segment-query ::= "[" expression? "]" [6] within-part ::= simple-within-part [7] simple-within-part ::= "within" simple-within-scope [8] simple-within-scope ::= "sentence" | "s" | "utterance" | "u" | "paragraph" | "p" | "turn" | "t" | "text" | "session" [11] expression ::= basic-expression | expression "|" expression /* or */ | expression "&" expression /* and */ | "(" expression ")" /* grouping */ | "!" expression /* not */ [11.11] expression ::= basic-expression | expression "|" expression /* or */ | expression "&" expression /* and */ [12] basic-expression ::= attribute operator flagged-regexp [12.11] basic-expression ::= '(' expression ')' /* grouping */ | "!" expression /* not */ | attribute operator flagged-regexp [13] operator ::= "=" /* equals */ | "!=" /* non-equals */ [14] quantifier ::= "+" /* one-or-more */ | "*" /* zero-or-more */ | "?" /* zero-or-one */ | "{" integer "}" /* exactly n-times */ | "{" integer? "," integer "}" /* at most */ | "{" integer "," integer? "}" /* min-max */ [15] flagged-regexp ::= regexp | regexp "/" regexp-flag+ [16] regexp-flag ::= "i" /* case-insensitive; Poliqarp/Perl compat */ | "I" /* case-sensitive; Poliqarp compat */ | "c" /* case-insensitive, CQP compat */ | "C" /* case-sensitive */ | "l" /* literal matching, CQP compat*/ | "d" /* diacritic agnostic matching, CQP compat */ [17] regexp ::= quoted-string [18] attribute ::= simple-attribute | qualified-attribute [19] simple-attribute ::= identifier [20] qualified-attribute ::= identifier ":" identifier [21] identifier ::= identifier-char identifier-char* [21.11] identifier ::= identifier-first-char identifier-char* [21.12] identifier-first-char ::= [a-zA-Z] [22] identifier-char ::= [a-zA-Z0-9\-] [24] integer ::= [0-9]+ [26] quoted-string ::= "'" (char | ws)* "'" /* single-quotes */ | """ (char | ws)* """ /* double-quotes */ [27] char ::= <any unicode codepoint excluding whitespace codepoints> | "\" escaped-char [28] ws ::= <any whitespace codepoint> [29] escaped-char ::= "\" /* backslash (\) */ | "'" /* single quote (') */ | """ /* double quote (") */ | "n" /* generic newline, i.e "\n", "\r", etc */ | "t" /* character tabulation (U+0009) */ | "x" hex hex /* Unicode codepoint with hex value hh */ | "u" hex hex hex hex /* Unicode codepoint with hex value hhhh */ | "U" hex hex hex hex hex hex hex hex /* Unicode codepoint with hex value hhhhhhhh */ [30] hex ::= [0-9a-fA-F]
Notes
- based on Poliqarp with inspiration from others
- "attribute": the annotation layer to be used, e.g. "word", "lemma", "pos" or qualified "pos:stts" the supported values for this construct are beyond the grammar and need to be defined in supplementary documents
- "simple-within-scope": possible values for scope
- "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.")
- "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph
- "article" | "session": something like a whole document
[27]
and[28]
"any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with[29]
:/- regex are not defined/guarded by this grammar :/
- non-continuous rule numbers are currently intended; we've already removed some. Rules will be renumbered, when grammar is fixed.
- Integrated Peter B's suggestion 2v2 and 3v2 together with 11v2 and 12v2 for ressolving structural ambiguity eventhough antlr handles this perfectly fine.
- Changed "identifier"
[21]
to only be allowed to start with a letter e.g. not digits and - (hyphen) to more resemble XML names.
Discussion
Peter Beinema, MPI, proposes some minor changes to the grammar:
- main query
[2]
/ simple query[3]
: above definition generate structural ambiguity. Not a problem for ANTLR (which selects the right-recursive solution), but some other parser generators generate all solutions - which are exponential wrt the number of main queries. I propose to use the alternative rules given below. - above rule
[2]
can generate infinite array of quantifiers:"word" +*{23}{,17}{2,}?+
would be legal. - rule
[5]
: option marker '?' makes "[]" a valid query. Propose to remove question mark. - expression
11
/ basic expression[12]
: structural ambiguity. See proposed alternative below. - attached file contains an antlr4 grammar, including comments on how to use it on mac / unix / linux platform
ENBF
[2 v2] /* in combination with [3 v2] no more left recursion or ambiguity, max 1 quantifier per simple-query */ main-query ::= simple-query | simple-query '|' main-query /* 'or' */ | simple-query main-query /* sequence */ | simple-query quantifier /* quantification */ [3 v2] simple_query ::= '(' main_query ')' /* embedding moved from main-query to simple-query | implicit-query | segment-query [5 v2] /* expression no longer optional */ segment-query : '[' expression ']' [11 v2] /* in combination with [12 v2] no longer left recursive or ambiguous expression : basic-expression | basic-expression '|' expression // or | basic-expression '&' expression // and [12 v2] basic-expression : '(' expression ')' // grouping | '!' expression // not | attribute operator flagged-regexp
Attachments (1)
-
FCS_QL_2.g4 (3.7 KB) - added by 9 years ago.
antlr (version 4.5) grammar for FCS-QL query
Download all attachments as: .zip