wiki:QueryLanguage

Context Navigation

Version 9 (modified by vronk, 14 years ago) (diff)
added Lucene in the introduction

CLARIN/CMDI Query Language

This is to discuss and describe the Query Language, which shall be used primarily for querying MDService, but with the aspiration to also comprise content search, allowing combined metadata and content search, thus being applicable for the requirements of EDC.

There are at present two main existing candidate query languages:

Lucene: http://lucene.apache.org/java/3_0_0/queryparsersyntax.html an Apache project: pure Java full-text search engine
SRU/CQL: http://www.loc.gov/standards/sru/specs/cql.html a successor of Z39.50 proposed by Library of Congress - see below

Both would most probably need certain extensions, additions, to suit our needs. One option is to try to support both. While this is dangerous due to limited dev resources, it would be not only "nice", but also an important message.

At the moment the whole rest of the article focuses on the CQL version. But as discussion continues, we should add analogous information about Lucene and also separate direct feature comparison of the two.

General info about SRU/CQL

SRU is a protocol suite for search in information retrieval systems
with a user-friendly query language CQL = Contextual Query Language
comes also in WS flavor: SRW
proposed by LoC as successor of Z39.50 http://www.loc.gov/standards/sru/index.html
submitted to OASIS (~ 2007?) for standardization process as part of the OASIS Search Web Services TC
seems rather low activity (mostly one author: Ray Denenberg @loc.gov)
but nevertheles: version 1.2 Committee Draft and version 2.0 Draft (2009-07-22, last draft even: 2010-01-06)
Seems rather spread in anglo-saxon world (most prominent implementor: OCLC/Worldcat), but also: The European Library
software supporting the protocol: http://www.indexdata.com/zebra
main operation/verb of the protocol: searchRetrieve
allows to define own Context Sets (~ namespaces) for accomodating own indices, relations and modifiers

Syntax/Grammar? (CQL)

main clause from the grammar of CQL:

 searchClause ::= index relation[/modifier] searchTerm

trying to fit in the existing syntax, i.e. only adding indices (and possibly relations, modifiers) in own context sets:

cmd: CLARIN Metadata proposed CLARIN's own context set in SRU for metadata
ccs: CLARIN Content Seach - proposed CLARIN's own context set in SRU for content

syntax extension (restriction) for CLARIN indices:

 index           ::= ['cmd.']cmdIndex | ['ccs.']contentIndex
 cmdIndex        ::= cmdIndex '.' cmdComponent
                     | cmdComponent
 cmdComponent    ::= {componentName} | {componentID}
 contentIndex    ::= wordLevelIndex | annotationIndex
 wordLevelIndex  ::= 'word' | 'w' | 'pos' | 'p' | 'lemma' | 'l' 
                     | 'thesaurus' | 't' | {...}
 annotationIndex ::= annotationPath ['.' annotationAttr]
 annotationPath  ::= annotationPath '.' {annotationElement}
                     | {annotationElement}

syntax extension for modifier to accomodate binding expressions:

	modifier ::= sruModifier | '/' var | {...?}
	var      ::= 'var=' varName
	varName  ::= [A-Z]

proposal for a syntax extension for relations:

 /* already defined in SRU/CQL: */ 
 relation        ::= comparitor [modifierList] 
 comparitor      ::= comparitorSymbol | namedComparitor 

 /* add: */
 namedComparitor ::= 'regexp' | 'pat' | '~'  /* regular expression search */
 namedComparitor ::= 'oneof' /* search for alternatives */

 /* allows using generic terms - Semantic search: */
 namedComparitor ::= sru.namedComparitor | 'isa'

sort clause: CQL 1.2 already provides a sort clause in the form: 'sortby {index list}' at the end of the whole query

term-only query

by definition resolves to cql.serverChoice = term
we could extend this and first try to interpret the term as a cmdIndex and only if such index does not exist, continue treat it as a term.
But still we would have to decide if the term should be searched in (all) metadata indices or rather in the content.

Metadata Query

MDService will provide two main methods/operations:

 .queryModel()
 .searchRetrieve()

queryModel() provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository. As query-parameter it only needs to accept/understand cmdIndex.

searchRetrieve() is the core operation to retrieve the actual MDRecords. In the query-parameter it has to accept any query on metadata. It probably shouldn't ignore even the content-part (ccs.) of the query, where it should check for the availability of indices used in the query in given collections.

Examples of `cmdIndex`

componentId

 clarin.eu:2625

componentName

 Actor

componentPath

 Actor.Contact.Phone

Examples of MD query

basic

 dc.title any "open access"
 dc.date > 1900 and dc.date < 1910

boolean operators

 Organisation any University and (dc.language=de or cmd.Country=Austria)
 and (dc.title any Liebe or cmd.Author any Trakl)

alternatives

 cmd.genre = (opera or novel or fantasy)
 /* proposed new relation: */
 cmd.genre '''oneof''' "opera novel fantasy"

if multiple conditions should apply to same Component/Element?:

bind-variables

 Actor.gender =/var=X f
 and Actor.age >/var=X 15

would probably select less than query:

 Actor.gender = f
 and Actor.age > 15

because both conditions shall apply to same Actor

Content Query

This chapter only describes pure content queries. The content search engines however will have to understand the metadata part of a query to some extent as well. See for this #CombinedQuery.

Examples

basic

 Liebe
 "not to be"
 pos=NN
 w regexp ^C.*N$

boolean

 ccs.word = "liebe" and ccs.pos ~ "V.*"
 
 ccs.word = Baum or  ccs.word = tree or ccs.word = arbre
 ccs.word = (Baum or tree or arbre)
 ccs.word oneof "Baum tree arbre"

multiple conditions for same word

 ccs.word =/ci "liebe" prox/unit=word/distance=0 ccs.pos="V.*"
 ccs.word =/ci "liebe" prox/0 ccs.pos="V.*"
 ccs.word =/ci/var=X "liebe" and ccs.pos =/var=X "V.*"

restrict to special parts of text annotation (in special elements)

 Heading regexp ^A.*
 Heading = A*

 Speaker any "I didn't say that"
 
 Speaker.gender =/var=X m
 and Speaker.gender =/var=Y f
 and Speaker any/var=X yes
 and Speaker any/var=Y no

However see notice in #OpenIssues!

thesaurus

 ccs.thes =   Person prox/s/0 ccs.thes =   work 
 ccs.word isa Person prox/s/0 ccs.word isa work

Combined Query

Although from the point of view of the user there is a clear distinction between the MD and content part of a query, in the technical view this is not so clear. This arises from the fact, that MDService in general returns MD records of individual items in a collection matching given metadata query, but the subsequent content search is submitted against the endpoint of the whole collection. Thus it is necessary to provide the search engine of given collection with the metadata query as well, so that this is able to apply this as a filter for the content search, effectively querying an implicit subcorpus.

On the other hand also the MDService probably should understand the content part of the query, where it should check for the availability of indices in given collections used in the query.

Mapping CLARIN Query Language to other Syntaxes

Metadata - XPath

Metadata will primarily have to be converted to XPath for search in MDRepository

CLARIN QL	XPath	CLARIN QL Example	XPath Example
{cmdComponent}	{cmdComponent}	Actor	Actor
{cmdPath}.	{cmdPath}/{cmdComponent}	Actor.Contact.Phone	Actor/Contact/Phone
{cmdIndex} {rel} {term}	{cmdIndex}[\. {rel} '{term}']	Actors.Actor.Sex=f	Actors/Actor/Sex[.='f']
{cmdIndex} any {term}	{cmdIndex}[contains(. '{term}')]	Organisation.Name any University	Organisation/Name[contains(.,'University')]
and, or, and not	?!	Organisation.Name any University and Actor.gender=m	?!

Candidate content search engines / target syntaxes

ddc: A powerful open-source corpus indexer / search engine http://www.ddc-concordance.org/
cqp: http://bulba.sdsu.edu/cqphelp.html
manatee: http://www.textforge.cz/querylanguage
cosmas?: http://www.ids-mannheim.de/cosmas2/uebersicht.html
leipzig-wortschatz?: http://corpora.uni-leipzig.de/

DDC

description	CLARIN QL	DDC	remarks
word-level:
any word-form	[ccs].w
just that word-form	[ccs].w={word-form}	@{word-form} $w={word-form}
lemma	[ccs].l={lemma} [ccs].lemma={lemma}	$l={lemma} %{lemma}
pos-tag	[ccs].pos={pos} [ccs].p={pos}	$p={pos} [{pos}]
morphological features	[ccs].morph={list of morph features}	[{list of morph features}]
thesaurus	[ccs].thes={superconcept}	{ {list of morph features} }
multiple criteria	{index1} =/var=X {term1} and {index2} =/var=X {term2}	{index1}={term1} with {index2}={term2}
patterns:
word starts with	[ccs].w = {word-start}* [ccs].w = ^{word-start} [ccs].w =^ {word-start}	{word-start}*
word ends with	[ccs].w = *{word-end} [ccs].w = {word-end}^ [ccs].w ^= {word-end}	*{word-end}
contains	[ccs].w any {word-part} [ccs].w = {word-part}	/.{word-part}./
boolean-operators:
and	and, &, &&	&&
or	or, \|, `\|\|`	`\|\|`
and not	not, ! \| !
distance-operators:
exact sequence, phrase	"{phrase}"	"{phrase}"
maximum distance ordered	prox/unit=word/distance < {max-distance} prox/w/<{max-distance}	#{max-distance}
exact distance ordered	prox/unit=word/distance = {distance} prox/w/{distance}	?: NEAR({Q1};{Q2};{distance}) && ! NEAR({Q1};{Q2};{distance - 1})	no direct support in DDC
maximum distance unordered	prox/unit=word/bidirectional/distance = {distance} prox/w/bi/>={distance}	NEAR({Q1};{Q2};{max-distance})
window	? see #OpenIssues	near({Q1};{Q2};{Q3};{max-distance})
term within annotation	ccs.{annotationIndex} any {term}	{term} #within {annotationIndex}	not fully supported in ddc!?
bibliographic-metadata:
bib-field	{index} any {term}	#has_field[{index},{term}]
date-range	dc.date < {date_from} and dc.date > {date_to}	#less_by_date[{date_from}, {date_to}]
further query options:
case-sensitive	?	{corpus option}
sort clause	{whole-query} sortBy {index-list}	#greater_by[{bib-field}] #less_by[{bib-field}]
restrict to subcorpus	?	{query} :{defined-subcorpus-list,}

Open Issues

Proximity

Proximity queries take a few parameters bloating the combinatory space. We have:

number of regarded terms (usually two, but may be more)
unit of distance (word, sentence, paragraph, page)
distance (0-n, zero meaning in the same unit)
comparitor ('=' exactly, '<' less than (no more than), '>' more than (at least))
in element, meaning the terms shall both occur in the same containing element.

Parameters 2. to 4. are well supported in CQL (and Z39.50). However there are two distinct problems wrt to proximity: same element and window, which are both recognised by the authors of CQL and adressed in the version 2.0 of CQL. Discussed by Ray Denenberg in DLib article January 09 in the chapter: 4.2 Proximity.

same element

example:

 Find the name "adam smith" and date "1965" in the same author element.

window