Context Navigation

CLARIN/CMDI Query Language

This is to discuss and describe the Query Language (or more broadly the "Search Interface/Protocol?"), which shall be used primarily for querying MDService, but with the aspiration to also comprise content search, allowing combined metadata and content search, thus being applicable for the requirements of EDC.

Regarding standards or existing protocols to base the work on at present the focus lies on the

SRU/CQL: a HTTP-based successor of Z39.50 proposed by Library of Congress - see below.

But there are other existing proposals/protocols which need to be investigated, like

OpenSearch?: http://www.opensearch.org/
interesting article about OpenSearch? and SRU/SRW integration: http://dlib.org/dlib/july10/hammond/07hammond.html
Lucene: http://lucene.apache.org/java/3_0_0/queryparsersyntax.html
an Apache project: pure Java full-text search engine
YQL: Yahoo Query Language
http://developer.yahoo.com/yql/

At the moment the whole rest of the article focuses on the CQL version. But as discussion continues, we should add analogous information about the other protocols and also a separate direct feature comparison.

General info about SRU/CQL

http://www.loc.gov/standards/sru/specs/cql.html
SRU is a protocol suite for search in information retrieval systems
with a user-friendly query language CQL = Contextual Query Language
comes also in WS (SOAP) flavor: SRW
proposed by LoC as successor of Z39.50 http://www.loc.gov/standards/sru/index.html
submitted to OASIS (~ 2007?) for standardization process as part of the OASIS Search Web Services TC
seems rather low activity (mostly one author: Ray Denenberg @loc.gov)
but nevertheles: version 1.2 Committee Draft and version 2.0 Draft (2009-07-22, last draft even: 2010-01-06)
Seems rather spread in anglo-saxon world (most prominent implementor: OCLC/Worldcat), but also: The European Library
software supporting the protocol: Zebra-suite, especially: http://www.indexdata.com/cql-java
main operation/verb of the protocol: searchRetrieve
allows to define own Context Sets (~ namespaces) for accomodating own indices, relations and modifiers

Syntax/Grammar? (CQL)

main clause from the grammar of CQL:

 searchClause ::= index relation[/modifier] searchTerm

trying to fit in the existing syntax, i.e. only adding indices (and possibly relations, modifiers) in own context sets:

cmd: CLARIN Metadata proposed CLARIN's own context set in SRU for metadata
ccs: CLARIN Content Seach - proposed CLARIN's own context set in SRU for content

syntax extension (restriction) for CLARIN indices:

 index           ::= ['cmd.']cmdIndex | ['ccs.']contentIndex
 cmdIndex        ::= cmdIndex '.' cmdComponent
                     | cmdComponent
 cmdComponent    ::= {componentName} | {componentID}
 contentIndex    ::= wordLevelIndex | annotationIndex
 wordLevelIndex  ::= 'word' | 'w' | 'pos' | 'p' | 'lemma' | 'l' 
                     | 'thesaurus' | 't' | {...}
 annotationIndex ::= annotationPath ['.' annotationAttr]
 annotationPath  ::= annotationPath '.' {annotationElement}
                     | {annotationElement}

syntax extension for modifier to accomodate binding expressions:

	modifier ::= sruModifier | '/' var | {...?}
	var      ::= 'var=' varName
	varName  ::= [A-Z]

proposal for a syntax extension for relations:

 /* already defined in SRU/CQL: */ 
 relation        ::= comparitor [modifierList] 
 comparitor      ::= comparitorSymbol | namedComparitor 

 /* add: */
 namedComparitor ::= 'regexp' | 'pat' | '~'  /* regular expression search */
 namedComparitor ::= 'oneof' /* search for alternatives */

 /* allows using generic terms - Semantic search: */
 namedComparitor ::= sru.namedComparitor | 'isa'

sort clause: CQL 1.2 already provides a sort clause in the form: 'sortby {index list}' at the end of the whole query

term-only query

by definition resolves to cql.serverChoice = term
we could extend this and first try to interpret the term as a cmdIndex and only if such index does not exist, continue treat it as a term.
But still we would have to decide if the term should be searched in (all) metadata indices or rather in the content.
Currently MDService treats simple term queries as a simple full-text search in all (fields of all) MDRecords.

Metadata Query

There was a longer development of the interface specification for MDRepository/MDService. See for more details under MDService#Interface definition.

And although at the moment MDService can’t claim conformance with SRU/CQL standard, the core parts conform to the specification: MDService accepts the query in CQL-format (even parsing it, ensuring the syntactic validity) and returns a <searchRetrieveResponse> result as defined by the protocol.

Following describes the specific usage of CQL for formulating MD-queries:

Types of `cmdIndex`

componentPath

can be used as index for querying MDRepository. If it is only partial, it still can be ambiguous. Only full paths (starting from root-component of given profile) are guaranteed unambiguous (Are they? well at least relative to distinct profiles/schemas). It can of course be defined down to the level of individual CMD_Elements (where the values actually reside) - actually MDRepository isn't able to distinguish between CMD_Components and CMD_Elements (besides the latter being the leaves of the XML-tree).

 Actor.Contact.Phone
 Collection.GeneralInfo.Title
 Collection.Project.Title

componentName, elementName

a special case of the componentPath, with only one component/element. Can also be used for querying MDRepository. It is even the preferable form of index, as it should be the most efficient. Will be most normally ambiguous (relative to the profiles).

 Actor

componentId

needs to be resolved to componentPath before it can be sent to MDRepository, as it is unaware of the componentIDs

  clarin.eu:cr1:c_1280305685207

datacategory-identifier

also has to be translated to appropriate componentPaths, as MDRepository is not aware of Datacategories either

  isocat:DC-2462: 
  isocat:annotationLevelType

Types of MD query

basic

 dc.title any "open access"
 dc.date > 1900 and dc.date < 1910

boolean operators

 Organisation any University and (dc.language=de or cmd.Country=Austria)
 and (dc.title any Liebe or cmd.Author any Trakl)

alternatives

 cmd.genre = (opera or novel or fantasy)
 /* proposed new relation: */
 cmd.genre '''oneof''' "opera novel fantasy"

collections

We propose to encode the collections-filter in the query as a special index (as opposed to a separate parameter)

   collection = {Collection-identifier}
  /* or: */
   partof oneof "{[Collection-identifiers]}"

Nevertheless collections need special handling, allowing searching transparently over the collection-closure.

  lang=de AND Organisation any ... /* in Collection */
  AND Title any the /* in Item */

If multiple conditions should apply to same Component/Element?, multiple variants are being discussed.

bind-variables (IMDI-style)

 Actor.gender =/var=X f
 and Actor.age >/var=X 15
 /* allows for distinguishing multiple contexts */
 Actor.gender =/var=X f  and Actor.age >/var=X 15 or
 Actor.gender =/var=Y m  and Actor.age </var=Y 10

bind-elements

simplified version, nearer to CQL2 proposal, but how to distinguish mulitple contexts?

 Actor.gender =/var=Actor f
 and Actor.age >/var=Actor 15

same element (CQL2-style) ´

proposed by Denenberg in an article about CQL2

 bib.name ="adam smith" PROX/element=bib.author dc.date =1965 

 gender = f PROX/element=Actor age > 15

bracketing (XPath-style)

criticized by Denenberg as structured querying, which is no go. But wonderfully intuitively clear:

  Actor[.gender=f and .age > 15]

  Actor ( gender=f and age > 15) or 
  Actor ( gender=m and age < 10)

All this variants are opposed to a simple:

 Actor.gender = f
 and Actor.age > 15

where this latter would probably match more records because the two conditions don't have to apply to same Actor.

Content Query

This chapter only describes pure content queries. The content search engines however will have to understand the metadata part of a query to some extent as well. See for this #CombinedQuery.

Textual Query

basic

 Liebe
 "not to be"
 pos=NN
 w regexp ^C.*N$

boolean

 ccs.word = "liebe" and ccs.pos ~ "V.*"
 
 ccs.word = Baum or  ccs.word = tree or ccs.word = arbre
 ccs.word = (Baum or tree or arbre)
 ccs.word oneof "Baum tree arbre"

multiple conditions for same word

 ccs.word =/ci "liebe" prox/unit=word/distance=0 ccs.pos="V.*"
 ccs.word =/ci "liebe" prox/0 ccs.pos="V.*"
 ccs.word =/ci/var=X "liebe" and ccs.pos =/var=X "V.*"

restrict to special parts of text annotation (in special elements)

 Heading regexp ^A.*
 Heading = A*

 Speaker any "I didn't say that"
 
 Speaker.gender =/var=X m
 and Speaker.gender =/var=Y f
 and Speaker any/var=X yes
 and Speaker any/var=Y no

However see notice in #OpenIssues!

thesaurus

 ccs.thes =   Person prox/s/0 ccs.thes =   work 
 ccs.word isa Person prox/s/0 ccs.word isa work

Sequential Tier Query

proximity

 Actor.1.w = s prox/s/0 ccs.thes =   work 
 ccs.word isa Person prox/s/0 ccs.word isa work

Combined Query

Although from the point of view of the user there is a clear distinction between the MD and content part of a query, in the technical view this is not so clear. This arises from the fact, that MDService in general returns MD records of individual items in a collection matching given metadata query, but the subsequent content search is submitted against the endpoint of the whole collection. Thus it is necessary to provide the search engine of given collection with the metadata query as well, so that this is able to apply this as a filter for the content search, effectively querying an implicit subcorpus.

On the other hand also the MDService probably should understand the content part of the query, where it should check for the availability of indices in given collections used in the query.

Mapping CLARIN Query Language to other Syntaxes

Metadata - XPath

Metadata will primarily have to be converted to XPath for search in MDRepository. Following is tentative, not normative!.

CLARIN QL	XPath	CLARIN QL Example	XPath Example
simple search : {term}	`//*[ft:query(.,{term})]`	Peter	`//*[ft:query(.,'Peter')]`
{cmdComponent}	`//{cmdComponent}`	Actor	`//Actor`
{cmdPath}.	`//{cmdPath}/{cmdComponent}`	Actor.Contact.Phone	`//Actor/Contact/Phone`
searchClause:= {cmdIndex} {rel} {term}	`//{cmdIndex}[\. {rel} '{term}']`	Actors.Actor.Sex=f	`//Actors/Actor/Sex[.='f']`
{cmdIndex} any {term}	`//{cmdIndex}[contains(. '{term}')]`	Organisation.Name any University	`//Organisation/Name[contains(.,'University')]`
AND	`//CMD[.//{Q1}][.//{Q2}]`	Organisation.Name any University and Actor.gender=m	`//CMD [.//Organisation/Name [contains(.,'University')]] [.//Actor.gender='m']`
AND NOT	`//CMD[.//{Q1}][not(.//{Q2})]`	Organisation.Name any University and_not Actor.gender=m	`//CMD [.//Organisation/Name [contains(.,'University')]] [not(.//Actor.gender='m')]`
OR	`//CMD[.//{Q1} or .//{Q2}]`	Organisation.Name any University or Actor.gender=m	`//CMD[.//Organisation/Name[contains(.,'University') or .//Actor.gender='m']`
query expansion (SemanticMapping):= {datcat} {rel} {term}	`//({cmdIndex1}\|{cmdIndex2}\|{cmdIndexN})[\. {rel} '{term}']`	dc:title any Peter	`//(olac-title \| teiHeader//titleStmt/title \| teiHeader//monogr/title )[contains(.,'Peter')]`

Candidate content search engines / target syntaxes

ddc: A powerful open-source corpus indexer / search engine http://www.ddc-concordance.org/
cqp: http://bulba.sdsu.edu/cqphelp.html
manatee: http://www.textforge.cz/querylanguage
cosmas?: http://www.ids-mannheim.de/cosmas2/uebersicht.html
leipzig-wortschatz?: http://corpora.uni-leipzig.de/

DDC

description	CLARIN QL	DDC	remarks
word-level:
any word-form	[ccs].w
just that word-form	[ccs].w={word-form}	@{word-form} $w={word-form}
lemma	[ccs].l={lemma} [ccs].lemma={lemma}	$l={lemma} %{lemma}
pos-tag	[ccs].pos={pos} [ccs].p={pos}	$p={pos} [{pos}]
morphological features	[ccs].morph={list of morph features}	[{list of morph features}]
thesaurus	[ccs].thes={superconcept}	{ {list of morph features} }
multiple criteria	{index1} =/var=X {term1} and {index2} =/var=X {term2}	{index1}={term1} with {index2}={term2}
patterns:
word starts with	[ccs].w = {word-start}* [ccs].w = ^{word-start} [ccs].w =^ {word-start}	{word-start}*
word ends with	[ccs].w = *{word-end} [ccs].w = {word-end}^ [ccs].w ^= {word-end}	*{word-end}
contains	[ccs].w any {word-part} [ccs].w = {word-part}	/.{word-part}./
boolean-operators:
and	and, &, &&	&&
or	or, \|, `\|\|`	`\|\|`
and not	not, ! \| !
distance-operators:
exact sequence, phrase	"{phrase}"	"{phrase}"
maximum distance ordered	prox/unit=word/distance < {max-distance} prox/w/<{max-distance}	#{max-distance}
exact distance ordered	prox/unit=word/distance = {distance} prox/w/{distance}	?: NEAR({Q1};{Q2};{distance}) && ! NEAR({Q1};{Q2};{distance - 1})	no direct support in DDC
maximum distance unordered	prox/unit=word/bidirectional/distance = {distance} prox/w/bi/>={distance}	NEAR({Q1};{Q2};{max-distance})
window	? see #OpenIssues	near({Q1};{Q2};{Q3};{max-distance})
term within annotation	ccs.{annotationIndex} any {term}	{term} #within {annotationIndex}	not fully supported in ddc!?
bibliographic-metadata:
bib-field	{index} any {term}	#has_field[{index},{term}]
date-range	dc.date < {date_from} and dc.date > {date_to}	#less_by_date[{date_from}, {date_to}]
further query options:
case-sensitive	?	{corpus option}
sort clause	{whole-query} sortBy {index-list}	#greater_by[{bib-field}] #less_by[{bib-field}]
restrict to subcorpus	?	{query} :{defined-subcorpus-list,}

Open Issues

Proximity

Proximity queries take a few parameters bloating the combinatory space. We have:

number of regarded terms (usually two, but may be more)
unit of distance (word, sentence, paragraph, page)
distance (0-n, zero meaning in the same unit)
comparitor ('=' exactly, '<' less than (no more than), '>' more than (at least))
in element, meaning the terms shall both occur in the same containing element.

Parameters 2. to 4. are well supported in CQL (and Z39.50). However there are two distinct problems wrt to proximity: same element and window, which are both recognised by the authors of CQL and adressed in the version 2.0 of CQL. Discussed by Ray Denenberg in DLib article January 09 in the chapter: 4.2 Proximity.

same element

example: Find the name "adam smith" and date "1965" in the same author element.

  bib.name ="adam smith" PROX/element=bib.author dc.date =1965

window

example:

 Find "cat", "hat", and "rat" within a 10-word window.

Faceted search

SRU/CQL 2.0 shall support the capability to convey the faceted results in a response.

Result format

In SRU 1.2 the format of the response is fixed. In SRU 2.0 response may be supplied in an alternative schema, but still the server has to support the SRU response schema.

It yet has to be evaluated in how far this accomodates our needs.

Last modified 13 years ago Last modified on 01/22/11 17:43:23

Attachments (1)

EDC_combinedquery.png (18.4 KB) - added by vronk 14 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text