wiki:Taskforces/FCS/FCS-QL

Version 1 (modified by Oliver Schonefeld, 9 years ago) (diff)

--

CLARIN Federated Content Search Query Language

A working draft for the CQP flavor for CLARIN Federated Content Search (FCS).

ENBF

 [1] query                ::= main-query within-part? sort-part?
  	 
 [2] main-query           ::= simple-query
                            | "(" main-query ")"            /* grouping */
                            | main-query "|" main-query     /* or */
                            | main-query main-query         /* sequence */
                            | main-query quantifier         /* quatification */ 	 

 [3] simple-query         ::= implicit-query
                            | segment-query 	 
 
 [4] implicit-query       ::= flagged-regexp 	 
 
 [5] segment-query        ::= "[" expression? "]" 	 

 [6] within-part          ::= simple-within-part
                            | complex-within-part 	 

 [7] simple-within-part   ::= "within" simple-within-scope

 [8]  simple-within-scope ::= "sentence"
                            | "s"
                            | "utterance"
                            | "u"
                            | "paragraph"
                            | "p"
                            | "turn"
                            | "t"
                            | "article"
                            | "session"  	 

 [9] compex-within-part   ::= "within" "[" expression "]"   /* TBD: allow more complex stuff? */

[10] sort-part            ::= /* TBD: do we want sorting */

[11] expression           ::= basic-expression
                            | expression "|" expression     /* or */
                            | expression "&" expression     /* and */
                            | "(" expression ")"            /* grouping */
                            | "!" expression                /* not */ 	 

[12] basic-expression     ::= attribute operator flagged-regexp 	 

[13] operator	          ::= "="                           /* equals */
                            | "!="                          /* non-equals */
                            | "~"                           /* TBD: fuzzy match? */
                            | "!~"                          /* TBD: fuzzy not? */

[14] quantifier           ::= "+"                           /* one-or-more */
                            | "*"                           /* zero-or-more */
                            | "?"                           /* zero-or-one */
                            | "{" integer "}"               /* exactly n-times */
                            | "{" integer? "," integer "}"  /* at most */
                            | "{" integer "," integer? "}"  /* min-max */	 

[15] flagged-regexp       ::= regexp
                            | regexp "/" regexp-flag+ 	 

[16] regexp-flag          ::= "i"  /* case-insensitive; Poliqarp/Perl compat */
                            | "I"  /* case-sensitive; Poliqarp compat */
                            | "c"  /* case-insensitive, CQP compat */
                            | "C"  /* case-sensitive */
                            | "l"  /* literal matching, CQP compat*/
                            | "d"  /* diacritic agnostic matching, CQP compat */ 
                            /* TBD: more? */
       
[17] regexp               ::= string

[18] attribute            ::= simple-attribute
                            | qualified-attribute

[19] simple-attribute     ::= identifier

[20] qualified-attribute  ::= identifier ":" identifier  

[21] identifier           ::= identifier-char identifier-char*

[22] identifier-char      ::= [a-zA-Z0-9\-]

[23] string               ::= plain-string
                            | quoted-string  	 

[24] integer              ::= [0-9]+ 

[25] plain-string         ::= char*

[26] quoted-string        ::= "'" (char | ws)* "'"  /* single-quotes */
                            | """ (char | ws)* """  /* double-quotes */

[27] char                 ::= <any unicode codepoint expluding whitespace codepoints>
                            | "\" escaped-char

[28] ws                   ::= <any whitespace coidpoint>

[29] escaped-char         ::= "\"                                  /* backslash (\) */
                            | "'"                                  /* single quote (') */
                            | """                                  /* double quote (") */
                            | "n"                                  /* generic newline, i.e "\n", "\r", etc */
                            | "t"                                  /* character tabulation (U+0009) */
                            | "x" hex hex                          /* Unicode codepoint with hex value hh */
                            | "u" hex hex hex hex                  /* Unicode codepoint with hex value hhhh */
                            | "U" hex hex hex hex hex hex hex hex  /* Unicode codepoint with hex value hhhhhhhh */ 

[30] hex                  ::= [0-9a-fA-F]

Notes

  • based on Poliqarp with inspiration from others
  • contains some "TBD"s (to be determined), e.g. do we want to add a sort-clause (or a meta-clause)?
  • "attribute": the annotation layer to be used, e.g. "word", "lemma", "pos" or qualified "pos:stts" the supported values for this construct are beyond the grammar and need to be defined in supplementary documents
  • "simple-within-scope": possible values for scope
    • "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.")
    • "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph
    • "article" | "session": something like a whole document

Attachments (1)

Download all attachments as: .zip