Version 9 (modified by 14 years ago) (diff) | ,
---|
CLARIN/CMDI Query Language
This is to discuss and describe the Query Language, which shall be used primarily for querying MDService, but with the aspiration to also comprise content search, allowing combined metadata and content search, thus being applicable for the requirements of EDC.
There are at present two main existing candidate query languages:
- Lucene
- http://lucene.apache.org/java/3_0_0/queryparsersyntax.html an Apache project: pure Java full-text search engine
- SRU/CQL
- http://www.loc.gov/standards/sru/specs/cql.html a successor of Z39.50 proposed by Library of Congress - see below
Both would most probably need certain extensions, additions, to suit our needs. One option is to try to support both. While this is dangerous due to limited dev resources, it would be not only "nice", but also an important message.
At the moment the whole rest of the article focuses on the CQL version. But as discussion continues, we should add analogous information about Lucene and also separate direct feature comparison of the two.
General info about SRU/CQL
- SRU is a protocol suite for search in information retrieval systems
- with a user-friendly query language CQL = Contextual Query Language
- comes also in WS flavor: SRW
- proposed by LoC as successor of Z39.50 http://www.loc.gov/standards/sru/index.html
- submitted to OASIS (~ 2007?) for standardization process as part of the OASIS Search Web Services TC
- seems rather low activity (mostly one author: Ray Denenberg @loc.gov)
- but nevertheles: version 1.2 Committee Draft and version 2.0 Draft (2009-07-22, last draft even: 2010-01-06)
- Seems rather spread in anglo-saxon world (most prominent implementor: OCLC/Worldcat), but also: The European Library
- software supporting the protocol: http://www.indexdata.com/zebra
- main operation/verb of the protocol:
searchRetrieve
- allows to define own Context Sets (~ namespaces) for accomodating own indices, relations and modifiers
Syntax/Grammar? (CQL)
main clause from the grammar of CQL:
searchClause ::= index relation[/modifier] searchTerm
trying to fit in the existing syntax, i.e. only adding indices (and possibly relations, modifiers) in own context sets:
- cmd
- CLARIN Metadata proposed CLARIN's own context set in SRU for metadata
- ccs
- CLARIN Content Seach - proposed CLARIN's own context set in SRU for content
syntax extension (restriction) for CLARIN indices:
index ::= ['cmd.']cmdIndex | ['ccs.']contentIndex cmdIndex ::= cmdIndex '.' cmdComponent | cmdComponent cmdComponent ::= {componentName} | {componentID} contentIndex ::= wordLevelIndex | annotationIndex wordLevelIndex ::= 'word' | 'w' | 'pos' | 'p' | 'lemma' | 'l' | 'thesaurus' | 't' | {...} annotationIndex ::= annotationPath ['.' annotationAttr] annotationPath ::= annotationPath '.' {annotationElement} | {annotationElement}
syntax extension for modifier to accomodate binding expressions:
modifier ::= sruModifier | '/' var | {...?} var ::= 'var=' varName varName ::= [A-Z]
proposal for a syntax extension for relations:
/* already defined in SRU/CQL: */ relation ::= comparitor [modifierList] comparitor ::= comparitorSymbol | namedComparitor /* add: */ namedComparitor ::= 'regexp' | 'pat' | '~' /* regular expression search */ namedComparitor ::= 'oneof' /* search for alternatives */ /* allows using generic terms - Semantic search: */ namedComparitor ::= sru.namedComparitor | 'isa'
- sort clause
- CQL 1.2 already provides a sort clause in the form: 'sortby {index list}' at the end of the whole query
term-only query
- by definition resolves to
cql.serverChoice = term
- we could extend this and first try to interpret the term as a
cmdIndex
and only if such index does not exist, continue treat it as a term. - But still we would have to decide if the term should be searched in (all) metadata indices or rather in the content.
Metadata Query
MDService will provide two main methods/operations:
.queryModel() .searchRetrieve()
queryModel()
provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository. As query-parameter it only needs to accept/understand cmdIndex
.
searchRetrieve()
is the core operation to retrieve the actual MDRecords. In the query
-parameter it has to accept any query on metadata. It probably shouldn't ignore even the content-part (ccs.
) of the query, where it should check for the availability of indices used in the query in given collections.
Examples of cmdIndex
- componentId
-
clarin.eu:2625
- componentName
-
Actor
- componentPath
-
Actor.Contact.Phone
Examples of MD query
- basic
-
dc.title any "open access" dc.date > 1900 and dc.date < 1910
- boolean operators
-
Organisation any University and (dc.language=de or cmd.Country=Austria) and (dc.title any Liebe or cmd.Author any Trakl)
- alternatives
-
cmd.genre = (opera or novel or fantasy) /* proposed new relation: */ cmd.genre '''oneof''' "opera novel fantasy"
if multiple conditions should apply to same Component/Element?:
- bind-variables
-
Actor.gender =/var=X f and Actor.age >/var=X 15
would probably select less than query:
Actor.gender = f and Actor.age > 15
because both conditions shall apply to same Actor
Content Query
This chapter only describes pure content queries. The content search engines however will have to understand the metadata part of a query to some extent as well. See for this #CombinedQuery.
Examples
- basic
-
Liebe "not to be" pos=NN w regexp ^C.*N$
- boolean
-
ccs.word = "liebe" and ccs.pos ~ "V.*" ccs.word = Baum or ccs.word = tree or ccs.word = arbre ccs.word = (Baum or tree or arbre) ccs.word oneof "Baum tree arbre"
- multiple conditions for same word
-
ccs.word =/ci "liebe" prox/unit=word/distance=0 ccs.pos="V.*" ccs.word =/ci "liebe" prox/0 ccs.pos="V.*" ccs.word =/ci/var=X "liebe" and ccs.pos =/var=X "V.*"
- restrict to special parts of text annotation (in special elements)
-
Heading regexp ^A.* Heading = A* Speaker any "I didn't say that" Speaker.gender =/var=X m and Speaker.gender =/var=Y f and Speaker any/var=X yes and Speaker any/var=Y no
However see notice in #OpenIssues!
- thesaurus
-
ccs.thes = Person prox/s/0 ccs.thes = work ccs.word isa Person prox/s/0 ccs.word isa work
Combined Query
Although from the point of view of the user there is a clear distinction between the MD and content part of a query, in the technical view this is not so clear. This arises from the fact, that MDService in general returns MD records of individual items in a collection matching given metadata query, but the subsequent content search is submitted against the endpoint of the whole collection. Thus it is necessary to provide the search engine of given collection with the metadata query as well, so that this is able to apply this as a filter for the content search, effectively querying an implicit subcorpus.
On the other hand also the MDService probably should understand the content part of the query, where it should check for the availability of indices in given collections used in the query.
Mapping CLARIN Query Language to other Syntaxes
Metadata - XPath
Metadata will primarily have to be converted to XPath for search in MDRepository
CLARIN QL | XPath | CLARIN QL Example | XPath Example |
{cmdComponent} | {cmdComponent} | Actor | Actor |
{cmdPath}. | {cmdPath}/{cmdComponent} | Actor.Contact.Phone | Actor/Contact/Phone |
{cmdIndex} {rel} {term} | {cmdIndex}[\. {rel} '{term}'] | Actors.Actor.Sex=f | Actors/Actor/Sex[.='f'] |
{cmdIndex} any {term} | {cmdIndex}[contains(. '{term}')] | Organisation.Name any University | Organisation/Name[contains(.,'University')] |
and, or, and not | ?! | Organisation.Name any University and Actor.gender=m | ?! |
Candidate content search engines / target syntaxes
- ddc
- A powerful open-source corpus indexer / search engine http://www.ddc-concordance.org/
- cqp
- http://bulba.sdsu.edu/cqphelp.html
- manatee
- http://www.textforge.cz/querylanguage
- cosmas?
- http://www.ids-mannheim.de/cosmas2/uebersicht.html
- leipzig-wortschatz?
- http://corpora.uni-leipzig.de/
DDC
description | CLARIN QL | DDC | remarks |
word-level: | |||
any word-form | [ccs].w | ||
just that word-form | [ccs].w={word-form} | @{word-form} $w={word-form} | |
lemma | [ccs].l={lemma} [ccs].lemma={lemma} | $l={lemma} %{lemma} | |
pos-tag | [ccs].pos={pos} [ccs].p={pos} | $p={pos} [{pos}] | |
morphological features | [ccs].morph={list of morph features} | [{list of morph features}] | |
thesaurus | [ccs].thes={superconcept} | { {list of morph features} } | |
multiple criteria | {index1} =/var=X {term1} and {index2} =/var=X {term2} | {index1}={term1} with {index2}={term2} | |
patterns: | |||
word starts with | [ccs].w = {word-start}* [ccs].w = ^{word-start} [ccs].w =^ {word-start} | {word-start}* | |
word ends with | [ccs].w = *{word-end} [ccs].w = {word-end}^ [ccs].w ^= {word-end} | *{word-end} | |
contains | [ccs].w any {word-part} [ccs].w = *{word-part}* | /.*{word-part}.*/ | |
boolean-operators: | |||
and | and, &, && | && | |
or | or, |, || | ||
| |
and not | not, ! | ! | ||
distance-operators: | |||
exact sequence, phrase | "{phrase}" | "{phrase}" | |
maximum distance ordered | prox/unit=word/distance < {max-distance} prox/w/<{max-distance} | #{max-distance} | |
exact distance ordered | prox/unit=word/distance = {distance} prox/w/{distance} | ?: NEAR({Q1};{Q2};{distance}) && ! NEAR({Q1};{Q2};{distance - 1}) | no direct support in DDC |
maximum distance unordered | prox/unit=word/bidirectional/distance = {distance} prox/w/bi/>={distance} | NEAR({Q1};{Q2};{max-distance}) | |
window | ? see #OpenIssues | near({Q1};{Q2};{Q3};{max-distance}) | |
term within annotation | ccs.{annotationIndex} any {term} | {term} #within {annotationIndex} | not fully supported in ddc!? |
bibliographic-metadata: | |||
bib-field | {index} any {term} | #has_field[{index},{term}] | |
date-range | dc.date < {date_from} and dc.date > {date_to} | #less_by_date[{date_from}, {date_to}] | |
further query options: | |||
case-sensitive | ? | {corpus option} | |
sort clause | {whole-query} sortBy {index-list} | #greater_by[{bib-field}] #less_by[{bib-field}] | |
restrict to subcorpus | ? | {query} :{defined-subcorpus-list,} |
Open Issues
Proximity
Proximity queries take a few parameters bloating the combinatory space. We have:
- number of regarded terms (usually two, but may be more)
- unit of distance (word, sentence, paragraph, page)
- distance (0-n, zero meaning in the same unit)
- comparitor ('=' exactly, '<' less than (no more than), '>' more than (at least))
- in element, meaning the terms shall both occur in the same containing element.
Parameters 2. to 4. are well supported in CQL (and Z39.50). However there are two distinct problems wrt to proximity: same element and window, which are both recognised by the authors of CQL and adressed in the version 2.0 of CQL. Discussed by Ray Denenberg in DLib article January 09 in the chapter: 4.2 Proximity.
same element
example:
Find the name "adam smith" and date "1965" in the same author element.
window
example:
Find "cat", "hat", and "rat" within a 10-word window.
Faceted search
SRU/CQL 2.0 shall support the capability to convey the faceted results in a response.
Result format
In SRU 1.2 the format of the response is fixed. In SRU 2.0 response may be supplied in an alternative schema, but still the server has to support the SRU response schema.
It yet has to be evaluated in how far this accomodates our needs.
Attachments (1)
- EDC_combinedquery.png (18.4 KB) - added by 14 years ago.
Download all attachments as: .zip