wiki:Taskforces/FCS/FCS-QL

Version 8 (modified by Leif-Jöran, 9 years ago) (diff)

--

CLARIN Federated Content Search Query Language

A working draft for the CQP flavor for CLARIN Federated Content Search (FCS).

EBNF

 [1] query                ::= main-query within-part?
  	 
 [2] main-query           ::= simple-query
                            | "(" main-query ")"            /* grouping */
                            | main-query "|" main-query     /* or */
                            | main-query main-query         /* sequence */
                            | main-query quantifier         /* quatification */ 	 

 [3] simple-query         ::= implicit-query
                            | segment-query 	 
 
 [4] implicit-query       ::= flagged-regexp 	 
 
 [5] segment-query        ::= "[" expression? "]" 	 

 [6] within-part          ::= simple-within-part

 [7] simple-within-part   ::= "within" simple-within-scope

 [8] simple-within-scope  ::= "sentence"
                            | "s"
                            | "utterance"
                            | "u"
                            | "paragraph"
                            | "p"
                            | "turn"
                            | "t"
                            | "text"
                            | "session"  	 

[11] expression           ::= basic-expression
                            | expression "|" expression     /* or */
                            | expression "&" expression     /* and */
                            | "(" expression ")"            /* grouping */
                            | "!" expression                /* not */ 	 

[12] basic-expression     ::= attribute operator flagged-regexp 	 

[13] operator	          ::= "="                           /* equals */
                            | "!="                          /* non-equals */

[14] quantifier           ::= "+"                           /* one-or-more */
                            | "*"                           /* zero-or-more */
                            | "?"                           /* zero-or-one */
                            | "{" integer "}"               /* exactly n-times */
                            | "{" integer? "," integer "}"  /* at most */
                            | "{" integer "," integer? "}"  /* min-max */	 

[15] flagged-regexp       ::= regexp
                            | regexp "/" regexp-flag+ 	 

[16] regexp-flag          ::= "i"  /* case-insensitive; Poliqarp/Perl compat */
                            | "I"  /* case-sensitive; Poliqarp compat */
                            | "c"  /* case-insensitive, CQP compat */
                            | "C"  /* case-sensitive */
                            | "l"  /* literal matching, CQP compat*/
                            | "d"  /* diacritic agnostic matching, CQP compat */ 
       
[17] regexp               ::= quoted-string

[18] attribute            ::= simple-attribute
                            | qualified-attribute

[19] simple-attribute     ::= identifier

[20] qualified-attribute  ::= identifier ":" identifier  

[21] identifier           ::= identifier-char identifier-char*

[21.11] identifier           ::= identifier-first-char identifier-char*

[21.12] identifier-first-char      ::= [a-zA-Z]

[22] identifier-char      ::= [a-zA-Z0-9\-]

[24] integer              ::= [0-9]+ 

[26] quoted-string        ::= "'" (char | ws)* "'"  /* single-quotes */
                            | """ (char | ws)* """  /* double-quotes */

[27] char                 ::= <any unicode codepoint excluding whitespace codepoints>
                            | "\" escaped-char

[28] ws                   ::= <any whitespace codepoint>

[29] escaped-char         ::= "\"                                  /* backslash (\) */
                            | "'"                                  /* single quote (') */
                            | """                                  /* double quote (") */
                            | "n"                                  /* generic newline, i.e "\n", "\r", etc */
                            | "t"                                  /* character tabulation (U+0009) */
                            | "x" hex hex                          /* Unicode codepoint with hex value hh */
                            | "u" hex hex hex hex                  /* Unicode codepoint with hex value hhhh */
                            | "U" hex hex hex hex hex hex hex hex  /* Unicode codepoint with hex value hhhhhhhh */ 

[30] hex                  ::= [0-9a-fA-F]

Notes

  • based on Poliqarp with inspiration from others
  • "attribute": the annotation layer to be used, e.g. "word", "lemma", "pos" or qualified "pos:stts" the supported values for this construct are beyond the grammar and need to be defined in supplementary documents
  • "simple-within-scope": possible values for scope
    • "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.")
    • "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph
    • "article" | "session": something like a whole document
  • [27] and [28] "any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with [29] :/
  • regex are not defined/guarded by this grammar :/
  • non-continuous rule numbers are currently intended; we've already removed some. Rules will be renumbered, when grammar is fixed.

Discussion

Peter Beinema, MPI, proposes some minor changes to the grammar:

  • main query [2] / simple query [3]: above definition generate structural ambiguity. Not a problem for ANTLR (which selects the right-recursive solution), but some other parser generators generate all solutions - which are exponential wrt the number of main queries. I propose to use the alternative rules given below.
  • above rule [2] can generate infinite array of quantifiers: "word" +*{23}{,17}{2,}?+ would be legal.
  • rule [5]: option marker '?' makes "[]" a valid query. Propose to remove question mark.
  • expression 11 / basic expression [12]: structural ambiguity. See proposed alternative below.
  • attached file contains an antlr4 grammar, including comments on how to use it on mac / unix / linux platform

ENBF

[2 v2]   /* in combination with [3 v2] no more left recursion or ambiguity, max 1 quantifier per simple-query */
main-query ::=
      simple-query
    | simple-query '|' main-query       /* 'or' */
    | simple-query main-query           /* sequence */
    | simple-query quantifier           /* quantification */

[3 v2]
simple_query ::=
      '(' main_query ')'  /* embedding moved from main-query to simple-query
    | implicit-query
    | segment-query

[5 v2]  /* expression no longer optional */
segment-query :
      '[' expression ']'

[11 v2]  /* in combination with [12 v2] no longer left recursive or ambiguous
expression :
      basic-expression
    | basic-expression '|' expression   // or
    | basic-expression '&' expression   // and

[12 v2]
basic-expression :
      '(' expression ')'        // grouping
    | '!' expression            // not
    | attribute operator flagged-regexp

Attachments (1)

Download all attachments as: .zip