= CLARIN Federated Content Search Query Language = A working draft for the CQP flavor for CLARIN Federated Content Search (FCS). == EBNF == {{{#!comment Please keep the EBNF nicely formatted. Thanks! }}} {{{ [1] query ::= main-query within-part? [2] main-query ::= simple-query | "(" main-query ")" /* grouping */ | main-query "|" main-query /* or */ | main-query main-query /* sequence */ | main-query quantifier /* quatification */ [2.11] main-query ::= simple-query | simple-query "|" main-query /* or */ | simple-query main-query /* sequence */ | simple-query quantifier /* quatification */ [3] simple-query ::= implicit-query | segment-query [3.11] simple-query ::= '(' main_query ')' | implicit-query | segment-query [4] implicit-query ::= flagged-regexp [5] segment-query ::= "[" expression? "]" [6] within-part ::= simple-within-part [7] simple-within-part ::= "within" simple-within-scope [8] simple-within-scope ::= "sentence" | "s" | "utterance" | "u" | "paragraph" | "p" | "turn" | "t" | "text" | "session" [11] expression ::= basic-expression | expression "|" expression /* or */ | expression "&" expression /* and */ | "(" expression ")" /* grouping */ | "!" expression /* not */ [11.11] expression ::= basic-expression | expression "|" expression /* or */ | expression "&" expression /* and */ [12] basic-expression ::= attribute operator flagged-regexp [12.11] basic-expression ::= '(' expression ')' /* grouping */ | "!" expression /* not */ | attribute operator flagged-regexp [13] operator ::= "=" /* equals */ | "!=" /* non-equals */ [14] quantifier ::= "+" /* one-or-more */ | "*" /* zero-or-more */ | "?" /* zero-or-one */ | "{" integer "}" /* exactly n-times */ | "{" integer? "," integer "}" /* at most */ | "{" integer "," integer? "}" /* min-max */ [15] flagged-regexp ::= regexp | regexp "/" regexp-flag+ [16] regexp-flag ::= "i" /* case-insensitive; Poliqarp/Perl compat */ | "I" /* case-sensitive; Poliqarp compat */ | "c" /* case-insensitive, CQP compat */ | "C" /* case-sensitive */ | "l" /* literal matching, CQP compat*/ | "d" /* diacritic agnostic matching, CQP compat */ [17] regexp ::= quoted-string [18] attribute ::= simple-attribute | qualified-attribute [19] simple-attribute ::= identifier [20] qualified-attribute ::= identifier ":" identifier [21] identifier ::= identifier-char identifier-char* [21.11] identifier ::= identifier-first-char identifier-char* [21.12] identifier-first-char ::= [a-zA-Z] [22] identifier-char ::= [a-zA-Z0-9\-] [24] integer ::= [0-9]+ [26] quoted-string ::= "'" (char | ws)* "'" /* single-quotes */ | """ (char | ws)* """ /* double-quotes */ [27] char ::= | "\" escaped-char [28] ws ::= [29] escaped-char ::= "\" /* backslash (\) */ | "'" /* single quote (') */ | """ /* double quote (") */ | "n" /* generic newline, i.e "\n", "\r", etc */ | "t" /* character tabulation (U+0009) */ | "x" hex hex /* Unicode codepoint with hex value hh */ | "u" hex hex hex hex /* Unicode codepoint with hex value hhhh */ | "U" hex hex hex hex hex hex hex hex /* Unicode codepoint with hex value hhhhhhhh */ [30] hex ::= [0-9a-fA-F] }}} == Notes == * based on Poliqarp with inspiration from others * "attribute": the annotation layer to be used, e.g. "word", "lemma", "pos" or qualified "pos:stts" the supported values for this construct are beyond the grammar and need to be defined in supplementary documents * "simple-within-scope": possible values for scope * "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.") * "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph * "article" | "session": something like a whole document * {{{[27]}}} and {{{[28]}}} "any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with {{{[29]}}} :/ * regex are not defined/guarded by this grammar :/ * non-continuous rule numbers are currently intended; we've already removed some. Rules will be renumbered, when grammar is fixed. == Discussion == Peter Beinema, MPI, proposes some minor changes to the grammar: * main query {{{[2]}}} / simple query {{{[3]}}}: above definition generate structural ambiguity. Not a problem for ANTLR (which selects the right-recursive solution), but some other parser generators generate all solutions - which are exponential wrt the number of main queries. I propose to use the alternative rules given below. * above rule {{{[2]}}} can generate infinite array of quantifiers: {{{"word" +*{23}{,17}{2,}?+}}} would be legal. * rule {{{[5]}}}: option marker '?' makes "[]" a valid query. Propose to remove question mark. * expression {{{11}}} / basic expression {{{[12]}}}: structural ambiguity. See proposed alternative below. * attached file contains an antlr4 grammar, including comments on how to use it on mac / unix / linux platform == ENBF == {{{ [2 v2] /* in combination with [3 v2] no more left recursion or ambiguity, max 1 quantifier per simple-query */ main-query ::= simple-query | simple-query '|' main-query /* 'or' */ | simple-query main-query /* sequence */ | simple-query quantifier /* quantification */ [3 v2] simple_query ::= '(' main_query ')' /* embedding moved from main-query to simple-query | implicit-query | segment-query [5 v2] /* expression no longer optional */ segment-query : '[' expression ']' [11 v2] /* in combination with [12 v2] no longer left recursive or ambiguous expression : basic-expression | basic-expression '|' expression // or | basic-expression '&' expression // and [12 v2] basic-expression : '(' expression ')' // grouping | '!' expression // not | attribute operator flagged-regexp }}}