wiki:QueryLanguage

CLARIN/CMDI Query Language

This is to discuss and describe the Query Language (or more broadly the "Search Interface/Protocol?"), which shall be used primarily for querying MDService, but with the aspiration to also comprise content search, allowing combined metadata and content search, thus being applicable for the requirements of EDC.

Regarding standards or existing protocols to base the work on at present the focus lies on the

SRU/CQL
a HTTP-based successor of Z39.50 proposed by Library of Congress - see below.

But there are other existing proposals/protocols which need to be investigated, like

OpenSearch?
http://www.opensearch.org/
interesting article about OpenSearch? and SRU/SRW integration: http://dlib.org/dlib/july10/hammond/07hammond.html
Lucene
http://lucene.apache.org/java/3_0_0/queryparsersyntax.html
an Apache project: pure Java full-text search engine
YQL
Yahoo Query Language
http://developer.yahoo.com/yql/

At the moment the whole rest of the article focuses on the CQL version. But as discussion continues, we should add analogous information about the other protocols and also a separate direct feature comparison.

General info about SRU/CQL

Syntax/Grammar? (CQL)

main clause from the grammar of CQL:

 searchClause ::= index relation[/modifier] searchTerm

trying to fit in the existing syntax, i.e. only adding indices (and possibly relations, modifiers) in own context sets:

cmd
CLARIN Metadata proposed CLARIN's own context set in SRU for metadata
ccs
CLARIN Content Seach - proposed CLARIN's own context set in SRU for content

syntax extension (restriction) for CLARIN indices:

 index           ::= ['cmd.']cmdIndex | ['ccs.']contentIndex
 cmdIndex        ::= cmdIndex '.' cmdComponent
                     | cmdComponent
 cmdComponent    ::= {componentName} | {componentID}
 contentIndex    ::= wordLevelIndex | annotationIndex
 wordLevelIndex  ::= 'word' | 'w' | 'pos' | 'p' | 'lemma' | 'l' 
                     | 'thesaurus' | 't' | {...}
 annotationIndex ::= annotationPath ['.' annotationAttr]
 annotationPath  ::= annotationPath '.' {annotationElement}
                     | {annotationElement}

syntax extension for modifier to accomodate binding expressions:

	modifier ::= sruModifier | '/' var | {...?}
	var      ::= 'var=' varName
	varName  ::= [A-Z]

proposal for a syntax extension for relations:

 /* already defined in SRU/CQL: */ 
 relation        ::= comparitor [modifierList] 
 comparitor      ::= comparitorSymbol | namedComparitor 

 /* add: */
 namedComparitor ::= 'regexp' | 'pat' | '~'  /* regular expression search */
 namedComparitor ::= 'oneof' /* search for alternatives */

 /* allows using generic terms - Semantic search: */
 namedComparitor ::= sru.namedComparitor | 'isa'   
sort clause
CQL 1.2 already provides a sort clause in the form: 'sortby {index list}' at the end of the whole query

term-only query

  • by definition resolves to cql.serverChoice = term
  • we could extend this and first try to interpret the term as a cmdIndex and only if such index does not exist, continue treat it as a term.
  • But still we would have to decide if the term should be searched in (all) metadata indices or rather in the content.
  • Currently MDService treats simple term queries as a simple full-text search in all (fields of all) MDRecords.

Metadata Query

There was a longer development of the interface specification for MDRepository/MDService. See for more details under MDService#Interface definition.

And although at the moment MDService can’t claim conformance with SRU/CQL standard, the core parts conform to the specification: MDService accepts the query in CQL-format (even parsing it, ensuring the syntactic validity) and returns a <searchRetrieveResponse> result as defined by the protocol.

Following describes the specific usage of CQL for formulating MD-queries:

Types of cmdIndex

componentPath
can be used as index for querying MDRepository. If it is only partial, it still can be ambiguous. Only full paths (starting from root-component of given profile) are guaranteed unambiguous (Are they? well at least relative to distinct profiles/schemas). It can of course be defined down to the level of individual CMD_Elements (where the values actually reside) - actually MDRepository isn't able to distinguish between CMD_Components and CMD_Elements (besides the latter being the leaves of the XML-tree).
 Actor.Contact.Phone
 Collection.GeneralInfo.Title
 Collection.Project.Title
componentName, elementName
a special case of the componentPath, with only one component/element. Can also be used for querying MDRepository. It is even the preferable form of index, as it should be the most efficient. Will be most normally ambiguous (relative to the profiles).
 Actor
componentId
needs to be resolved to componentPath before it can be sent to MDRepository, as it is unaware of the componentIDs
  clarin.eu:cr1:c_1280305685207
datacategory-identifier
also has to be translated to appropriate componentPaths, as MDRepository is not aware of Datacategories either
  isocat:DC-2462: 
  isocat:annotationLevelType

Types of MD query

basic
 dc.title any "open access"
 dc.date > 1900 and dc.date < 1910
boolean operators
 Organisation any University and (dc.language=de or cmd.Country=Austria)
 and (dc.title any Liebe or cmd.Author any Trakl)
alternatives
 cmd.genre = (opera or novel or fantasy)
 /* proposed new relation: */
 cmd.genre '''oneof''' "opera novel fantasy" 
collections
We propose to encode the collections-filter in the query as a special index (as opposed to a separate parameter)
   collection = {Collection-identifier}
  /* or: */
   partof oneof "{[Collection-identifiers]}"
Nevertheless collections need special handling, allowing searching transparently over the collection-closure.
  lang=de AND Organisation any ... /* in Collection */
  AND Title any the /* in Item */

If multiple conditions should apply to same Component/Element?, multiple variants are being discussed.

bind-variables (IMDI-style)
 Actor.gender =/var=X f
 and Actor.age >/var=X 15
 /* allows for distinguishing multiple contexts */
 Actor.gender =/var=X f  and Actor.age >/var=X 15 or
 Actor.gender =/var=Y m  and Actor.age </var=Y 10
bind-elements
simplified version, nearer to CQL2 proposal, but how to distinguish mulitple contexts?
 Actor.gender =/var=Actor f
 and Actor.age >/var=Actor 15
same element (CQL2-style) ´
proposed by Denenberg in an article about CQL2
 bib.name ="adam smith" PROX/element=bib.author dc.date =1965 

 gender = f PROX/element=Actor age > 15
bracketing (XPath-style)
criticized by Denenberg as structured querying, which is no go. But wonderfully intuitively clear:
  Actor[.gender=f and .age > 15]

  Actor ( gender=f and age > 15) or 
  Actor ( gender=m and age < 10)

All this variants are opposed to a simple:

 Actor.gender = f
 and Actor.age > 15 

where this latter would probably match more records because the two conditions don't have to apply to same Actor.

Content Query

This chapter only describes pure content queries. The content search engines however will have to understand the metadata part of a query to some extent as well. See for this #CombinedQuery.

Textual Query

basic
 Liebe
 "not to be"
 pos=NN
 w regexp ^C.*N$
boolean
 ccs.word = "liebe" and ccs.pos ~ "V.*"
 
 ccs.word = Baum or  ccs.word = tree or ccs.word = arbre
 ccs.word = (Baum or tree or arbre)
 ccs.word oneof "Baum tree arbre"
multiple conditions for same word
 ccs.word =/ci "liebe" prox/unit=word/distance=0 ccs.pos="V.*"
 ccs.word =/ci "liebe" prox/0 ccs.pos="V.*"
 ccs.word =/ci/var=X "liebe" and ccs.pos =/var=X "V.*"
restrict to special parts of text annotation (in special elements)
 Heading regexp ^A.*
 Heading = A*

 Speaker any "I didn't say that"
 
 Speaker.gender =/var=X m
 and Speaker.gender =/var=Y f
 and Speaker any/var=X yes
 and Speaker any/var=Y no

However see notice in #OpenIssues!

thesaurus
 ccs.thes =   Person prox/s/0 ccs.thes =   work 
 ccs.word isa Person prox/s/0 ccs.word isa work

Sequential Tier Query

proximity

 Actor.1.w = s prox/s/0 ccs.thes =   work 
 ccs.word isa Person prox/s/0 ccs.word isa work

Combined Query

Although from the point of view of the user there is a clear distinction between the MD and content part of a query, in the technical view this is not so clear. This arises from the fact, that MDService in general returns MD records of individual items in a collection matching given metadata query, but the subsequent content search is submitted against the endpoint of the whole collection. Thus it is necessary to provide the search engine of given collection with the metadata query as well, so that this is able to apply this as a filter for the content search, effectively querying an implicit subcorpus.

On the other hand also the MDService probably should understand the content part of the query, where it should check for the availability of indices in given collections used in the query.

Mapping CLARIN Query Language to other Syntaxes

Metadata - XPath

Metadata will primarily have to be converted to XPath for search in MDRepository. Following is tentative, not normative!.

CLARIN QL XPath CLARIN QL Example XPath Example
simple search : {term} //*[ft:query(.,{term})] Peter //*[ft:query(.,'Peter')]
{cmdComponent} //{cmdComponent} Actor //Actor
{cmdPath}. //{cmdPath}/{cmdComponent} Actor.Contact.Phone //Actor/Contact/Phone
searchClause:=
{cmdIndex} {rel} {term}
//{cmdIndex}[\. {rel} '{term}'] Actors.Actor.Sex=f //Actors/Actor/Sex[.='f']
{cmdIndex} any {term} //{cmdIndex}[contains(. '{term}')] Organisation.Name any University //Organisation/Name[contains(.,'University')]
AND //CMD[.//{Q1}][.//{Q2}] Organisation.Name any University and Actor.gender=m //CMD [.//Organisation/Name [contains(.,'University')]] [.//Actor.gender='m']
AND NOT //CMD[.//{Q1}][not(.//{Q2})] Organisation.Name any University and_not Actor.gender=m //CMD [.//Organisation/Name [contains(.,'University')]] [not(.//Actor.gender='m')]
OR //CMD[.//{Q1} or .//{Q2}] Organisation.Name any University or Actor.gender=m //CMD[.//Organisation/Name[contains(.,'University') or .//Actor.gender='m']
query expansion (SemanticMapping):=
{datcat} {rel} {term}
//({cmdIndex1}|{cmdIndex2}|{cmdIndexN})[\. {rel} '{term}'] dc:title any Peter //(olac-title | teiHeader//titleStmt/title | teiHeader//monogr/title )[contains(.,'Peter')]

Candidate content search engines / target syntaxes

ddc
A powerful open-source corpus indexer / search engine http://www.ddc-concordance.org/
cqp
http://bulba.sdsu.edu/cqphelp.html
manatee
http://www.textforge.cz/querylanguage
cosmas?
http://www.ids-mannheim.de/cosmas2/uebersicht.html
leipzig-wortschatz?
http://corpora.uni-leipzig.de/

DDC

description CLARIN QL DDC remarks
word-level:
any word-form [ccs].w
just that word-form [ccs].w={word-form} @{word-form}
$w={word-form}
lemma [ccs].l={lemma}
[ccs].lemma={lemma}
$l={lemma}
%{lemma}
pos-tag [ccs].pos={pos}
[ccs].p={pos}
$p={pos}
[{pos}]
morphological features [ccs].morph={list of morph features} [{list of morph features}]
thesaurus [ccs].thes={superconcept} { {list of morph features} }
multiple criteria {index1} =/var=X {term1} and {index2} =/var=X {term2} {index1}={term1} with {index2}={term2}
patterns:
word starts with [ccs].w = {word-start}*
[ccs].w = ^{word-start}
[ccs].w =^ {word-start}
{word-start}*
word ends with [ccs].w = *{word-end}
[ccs].w = {word-end}^
[ccs].w ^= {word-end}
*{word-end}
contains [ccs].w any {word-part}
[ccs].w = *{word-part}*
/.*{word-part}.*/
boolean-operators:
and and, &, && &&
or or, |, || ||
and not not, ! | !
distance-operators:
exact sequence, phrase "{phrase}" "{phrase}"
maximum distance ordered prox/unit=word/distance < {max-distance}
prox/w/<{max-distance}
#{max-distance}
exact distance ordered prox/unit=word/distance = {distance}
prox/w/{distance}
?: NEAR({Q1};{Q2};{distance}) && ! NEAR({Q1};{Q2};{distance - 1}) no direct support in DDC
maximum distance unordered prox/unit=word/bidirectional/distance = {distance}
prox/w/bi/>={distance}
NEAR({Q1};{Q2};{max-distance})
window ? see #OpenIssues near({Q1};{Q2};{Q3};{max-distance})
term within annotation ccs.{annotationIndex} any {term} {term} #within {annotationIndex} not fully supported in ddc!?
bibliographic-metadata:
bib-field {index} any {term} #has_field[{index},{term}]
date-range dc.date < {date_from} and dc.date > {date_to} #less_by_date[{date_from}, {date_to}]
further query options:
case-sensitive ? {corpus option}
sort clause {whole-query} sortBy {index-list} #greater_by[{bib-field}]
#less_by[{bib-field}]
restrict to subcorpus ? {query} :{defined-subcorpus-list,}

Open Issues

Proximity

Proximity queries take a few parameters bloating the combinatory space. We have:

  1. number of regarded terms (usually two, but may be more)
  2. unit of distance (word, sentence, paragraph, page)
  3. distance (0-n, zero meaning in the same unit)
  4. comparitor ('=' exactly, '<' less than (no more than), '>' more than (at least))
  5. in element, meaning the terms shall both occur in the same containing element.

Parameters 2. to 4. are well supported in CQL (and Z39.50). However there are two distinct problems wrt to proximity: same element and window, which are both recognised by the authors of CQL and adressed in the version 2.0 of CQL. Discussed by Ray Denenberg in DLib article January 09 in the chapter: 4.2 Proximity.

same element

example: Find the name "adam smith" and date "1965" in the same author element.

  bib.name ="adam smith" PROX/element=bib.author dc.date =1965 

window

example:

 Find "cat", "hat", and "rat" within a 10-word window.

Faceted search

SRU/CQL 2.0 shall support the capability to convey the faceted results in a response.

Result format

In SRU 1.2 the format of the response is fixed. In SRU 2.0 response may be supplied in an alternative schema, but still the server has to support the SRU response schema.

It yet has to be evaluated in how far this accomodates our needs.

Last modified 13 years ago Last modified on 01/22/11 17:43:23

Attachments (1)

Download all attachments as: .zip