= CLARIN Federated Content Search Query Language = A working draft for the CQP flavor for CLARIN Federated Content Search (FCS). == ENBF == {{{#!comment Please keep the ENBF nicely formatted. Thanks! }}} {{{ [1] query ::= main-query within-part? sort-part? [2] main-query ::= simple-query | "(" main-query ")" /* grouping */ | main-query "|" main-query /* or */ | main-query main-query /* sequence */ | main-query quantifier /* quatification */ [3] simple-query ::= implicit-query | segment-query [4] implicit-query ::= flagged-regexp [5] segment-query ::= "[" expression? "]" [6] within-part ::= simple-within-part | complex-within-part [7] simple-within-part ::= "within" simple-within-scope [8] simple-within-scope ::= "sentence" | "s" | "utterance" | "u" | "paragraph" | "p" | "turn" | "t" | "article" | "session" [9] compex-within-part ::= "within" "[" expression "]" /* TBD: allow more complex stuff? */ [10] sort-part ::= /* TBD: do we want sorting */ [11] expression ::= basic-expression | expression "|" expression /* or */ | expression "&" expression /* and */ | "(" expression ")" /* grouping */ | "!" expression /* not */ [12] basic-expression ::= attribute operator flagged-regexp [13] operator ::= "=" /* equals */ | "!=" /* non-equals */ | "~" /* TBD: fuzzy match? */ | "!~" /* TBD: fuzzy not? */ [14] quantifier ::= "+" /* one-or-more */ | "*" /* zero-or-more */ | "?" /* zero-or-one */ | "{" integer "}" /* exactly n-times */ | "{" integer? "," integer "}" /* at most */ | "{" integer "," integer? "}" /* min-max */ [15] flagged-regexp ::= regexp | regexp "/" regexp-flag+ [16] regexp-flag ::= "i" /* case-insensitive; Poliqarp/Perl compat */ | "I" /* case-sensitive; Poliqarp compat */ | "c" /* case-insensitive, CQP compat */ | "C" /* case-sensitive */ | "l" /* literal matching, CQP compat*/ | "d" /* diacritic agnostic matching, CQP compat */ /* TBD: more? */ [17] regexp ::= string [18] attribute ::= simple-attribute | qualified-attribute [19] simple-attribute ::= identifier [20] qualified-attribute ::= identifier ":" identifier [21] identifier ::= identifier-char identifier-char* [22] identifier-char ::= [a-zA-Z0-9\-] [23] string ::= plain-string | quoted-string [24] integer ::= [0-9]+ [25] plain-string ::= char* [26] quoted-string ::= "'" (char | ws)* "'" /* single-quotes */ | """ (char | ws)* """ /* double-quotes */ [27] char ::= | "\" escaped-char [28] ws ::= [29] escaped-char ::= "\" /* backslash (\) */ | "'" /* single quote (') */ | """ /* double quote (") */ | "n" /* generic newline, i.e "\n", "\r", etc */ | "t" /* character tabulation (U+0009) */ | "x" hex hex /* Unicode codepoint with hex value hh */ | "u" hex hex hex hex /* Unicode codepoint with hex value hhhh */ | "U" hex hex hex hex hex hex hex hex /* Unicode codepoint with hex value hhhhhhhh */ [30] hex ::= [0-9a-fA-F] }}} == Notes == * based on Poliqarp with inspiration from others * contains some "TBD"s (to be determined), e.g. do we want to add a sort-clause (or a meta-clause)? * "attribute": the annotation layer to be used, e.g. "word", "lemma", "pos" or qualified "pos:stts" the supported values for this construct are beyond the grammar and need to be defined in supplementary documents * "simple-within-scope": possible values for scope * "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.") * "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph * "article" | "session": something like a whole document