wiki:QueryLanguage

Version 9 (modified by vronk, 14 years ago) (diff)

added Lucene in the introduction

CLARIN/CMDI Query Language

This is to discuss and describe the Query Language, which shall be used primarily for querying MDService, but with the aspiration to also comprise content search, allowing combined metadata and content search, thus being applicable for the requirements of EDC.

There are at present two main existing candidate query languages:

Lucene
http://lucene.apache.org/java/3_0_0/queryparsersyntax.html an Apache project: pure Java full-text search engine
SRU/CQL
http://www.loc.gov/standards/sru/specs/cql.html a successor of Z39.50 proposed by Library of Congress - see below

Both would most probably need certain extensions, additions, to suit our needs. One option is to try to support both. While this is dangerous due to limited dev resources, it would be not only "nice", but also an important message.

At the moment the whole rest of the article focuses on the CQL version. But as discussion continues, we should add analogous information about Lucene and also separate direct feature comparison of the two.

General info about SRU/CQL

Syntax/Grammar? (CQL)

main clause from the grammar of CQL:

 searchClause ::= index relation[/modifier] searchTerm

trying to fit in the existing syntax, i.e. only adding indices (and possibly relations, modifiers) in own context sets:

cmd
CLARIN Metadata proposed CLARIN's own context set in SRU for metadata
ccs
CLARIN Content Seach - proposed CLARIN's own context set in SRU for content

syntax extension (restriction) for CLARIN indices:

 index           ::= ['cmd.']cmdIndex | ['ccs.']contentIndex
 cmdIndex        ::= cmdIndex '.' cmdComponent
                     | cmdComponent
 cmdComponent    ::= {componentName} | {componentID}
 contentIndex    ::= wordLevelIndex | annotationIndex
 wordLevelIndex  ::= 'word' | 'w' | 'pos' | 'p' | 'lemma' | 'l' 
                     | 'thesaurus' | 't' | {...}
 annotationIndex ::= annotationPath ['.' annotationAttr]
 annotationPath  ::= annotationPath '.' {annotationElement}
                     | {annotationElement}

syntax extension for modifier to accomodate binding expressions:

	modifier ::= sruModifier | '/' var | {...?}
	var      ::= 'var=' varName
	varName  ::= [A-Z]

proposal for a syntax extension for relations:

 /* already defined in SRU/CQL: */ 
 relation        ::= comparitor [modifierList] 
 comparitor      ::= comparitorSymbol | namedComparitor 

 /* add: */
 namedComparitor ::= 'regexp' | 'pat' | '~'  /* regular expression search */
 namedComparitor ::= 'oneof' /* search for alternatives */

 /* allows using generic terms - Semantic search: */
 namedComparitor ::= sru.namedComparitor | 'isa'   
sort clause
CQL 1.2 already provides a sort clause in the form: 'sortby {index list}' at the end of the whole query

term-only query

  • by definition resolves to cql.serverChoice = term
  • we could extend this and first try to interpret the term as a cmdIndex and only if such index does not exist, continue treat it as a term.
  • But still we would have to decide if the term should be searched in (all) metadata indices or rather in the content.

Metadata Query

MDService will provide two main methods/operations:

 .queryModel()
 .searchRetrieve()

queryModel() provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository. As query-parameter it only needs to accept/understand cmdIndex.

searchRetrieve() is the core operation to retrieve the actual MDRecords. In the query-parameter it has to accept any query on metadata. It probably shouldn't ignore even the content-part (ccs.) of the query, where it should check for the availability of indices used in the query in given collections.

Examples of cmdIndex

componentId
 clarin.eu:2625
componentName
 Actor
componentPath
 Actor.Contact.Phone

Examples of MD query

basic
 dc.title any "open access"
 dc.date > 1900 and dc.date < 1910
boolean operators
 Organisation any University and (dc.language=de or cmd.Country=Austria)
 and (dc.title any Liebe or cmd.Author any Trakl)
alternatives
 cmd.genre = (opera or novel or fantasy)
 /* proposed new relation: */
 cmd.genre '''oneof''' "opera novel fantasy" 

if multiple conditions should apply to same Component/Element?:

bind-variables
 Actor.gender =/var=X f
 and Actor.age >/var=X 15

would probably select less than query:

 Actor.gender = f
 and Actor.age > 15 

because both conditions shall apply to same Actor

Content Query

This chapter only describes pure content queries. The content search engines however will have to understand the metadata part of a query to some extent as well. See for this #CombinedQuery.

Examples

basic
 Liebe
 "not to be"
 pos=NN
 w regexp ^C.*N$
boolean
 ccs.word = "liebe" and ccs.pos ~ "V.*"
 
 ccs.word = Baum or  ccs.word = tree or ccs.word = arbre
 ccs.word = (Baum or tree or arbre)
 ccs.word oneof "Baum tree arbre"
multiple conditions for same word
 ccs.word =/ci "liebe" prox/unit=word/distance=0 ccs.pos="V.*"
 ccs.word =/ci "liebe" prox/0 ccs.pos="V.*"
 ccs.word =/ci/var=X "liebe" and ccs.pos =/var=X "V.*"
restrict to special parts of text annotation (in special elements)
 Heading regexp ^A.*
 Heading = A*

 Speaker any "I didn't say that"
 
 Speaker.gender =/var=X m
 and Speaker.gender =/var=Y f
 and Speaker any/var=X yes
 and Speaker any/var=Y no

However see notice in #OpenIssues!

thesaurus
 ccs.thes =   Person prox/s/0 ccs.thes =   work 
 ccs.word isa Person prox/s/0 ccs.word isa work

Combined Query

Although from the point of view of the user there is a clear distinction between the MD and content part of a query, in the technical view this is not so clear. This arises from the fact, that MDService in general returns MD records of individual items in a collection matching given metadata query, but the subsequent content search is submitted against the endpoint of the whole collection. Thus it is necessary to provide the search engine of given collection with the metadata query as well, so that this is able to apply this as a filter for the content search, effectively querying an implicit subcorpus.

On the other hand also the MDService probably should understand the content part of the query, where it should check for the availability of indices in given collections used in the query.

Mapping CLARIN Query Language to other Syntaxes

Metadata - XPath

Metadata will primarily have to be converted to XPath for search in MDRepository

CLARIN QL XPath CLARIN QL Example XPath Example
{cmdComponent} {cmdComponent} Actor Actor
{cmdPath}. {cmdPath}/{cmdComponent} Actor.Contact.Phone Actor/Contact/Phone
{cmdIndex} {rel} {term} {cmdIndex}[\. {rel} '{term}'] Actors.Actor.Sex=f Actors/Actor/Sex[.='f']
{cmdIndex} any {term} {cmdIndex}[contains(. '{term}')] Organisation.Name any University Organisation/Name[contains(.,'University')]
and, or, and not ?! Organisation.Name any University and Actor.gender=m ?!

Candidate content search engines / target syntaxes

ddc
A powerful open-source corpus indexer / search engine http://www.ddc-concordance.org/
cqp
http://bulba.sdsu.edu/cqphelp.html
manatee
http://www.textforge.cz/querylanguage
cosmas?
http://www.ids-mannheim.de/cosmas2/uebersicht.html
leipzig-wortschatz?
http://corpora.uni-leipzig.de/

DDC

description CLARIN QL DDC remarks
word-level:
any word-form [ccs].w
just that word-form [ccs].w={word-form} @{word-form}
$w={word-form}
lemma [ccs].l={lemma}
[ccs].lemma={lemma}
$l={lemma}
%{lemma}
pos-tag [ccs].pos={pos}
[ccs].p={pos}
$p={pos}
[{pos}]
morphological features [ccs].morph={list of morph features} [{list of morph features}]
thesaurus [ccs].thes={superconcept} { {list of morph features} }
multiple criteria {index1} =/var=X {term1} and {index2} =/var=X {term2} {index1}={term1} with {index2}={term2}
patterns:
word starts with [ccs].w = {word-start}*
[ccs].w = ^{word-start}
[ccs].w =^ {word-start}
{word-start}*
word ends with [ccs].w = *{word-end}
[ccs].w = {word-end}^
[ccs].w ^= {word-end}
*{word-end}
contains [ccs].w any {word-part}
[ccs].w = *{word-part}*
/.*{word-part}.*/
boolean-operators:
and and, &, && &&
or or, |, || ||
and not not, ! | !
distance-operators:
exact sequence, phrase "{phrase}" "{phrase}"
maximum distance ordered prox/unit=word/distance < {max-distance}
prox/w/<{max-distance}
#{max-distance}
exact distance ordered prox/unit=word/distance = {distance}
prox/w/{distance}
?: NEAR({Q1};{Q2};{distance}) && ! NEAR({Q1};{Q2};{distance - 1}) no direct support in DDC
maximum distance unordered prox/unit=word/bidirectional/distance = {distance}
prox/w/bi/>={distance}
NEAR({Q1};{Q2};{max-distance})
window ? see #OpenIssues near({Q1};{Q2};{Q3};{max-distance})
term within annotation ccs.{annotationIndex} any {term} {term} #within {annotationIndex} not fully supported in ddc!?
bibliographic-metadata:
bib-field {index} any {term} #has_field[{index},{term}]
date-range dc.date < {date_from} and dc.date > {date_to} #less_by_date[{date_from}, {date_to}]
further query options:
case-sensitive ? {corpus option}
sort clause {whole-query} sortBy {index-list} #greater_by[{bib-field}]
#less_by[{bib-field}]
restrict to subcorpus ? {query} :{defined-subcorpus-list,}

Open Issues

Proximity

Proximity queries take a few parameters bloating the combinatory space. We have:

  1. number of regarded terms (usually two, but may be more)
  2. unit of distance (word, sentence, paragraph, page)
  3. distance (0-n, zero meaning in the same unit)
  4. comparitor ('=' exactly, '<' less than (no more than), '>' more than (at least))
  5. in element, meaning the terms shall both occur in the same containing element.

Parameters 2. to 4. are well supported in CQL (and Z39.50). However there are two distinct problems wrt to proximity: same element and window, which are both recognised by the authors of CQL and adressed in the version 2.0 of CQL. Discussed by Ray Denenberg in DLib article January 09 in the chapter: 4.2 Proximity.

same element

example:

 Find the name "adam smith" and date "1965" in the same author element.

window

example:

 Find "cat", "hat", and "rat" within a 10-word window.

Faceted search

SRU/CQL 2.0 shall support the capability to convey the faceted results in a response.

Result format

In SRU 1.2 the format of the response is fixed. In SRU 2.0 response may be supplied in an alternative schema, but still the server has to support the SRU response schema.

It yet has to be evaluated in how far this accomodates our needs.

Attachments (1)

Download all attachments as: .zip