wiki:FCS-FeatureMatrix

FCS - Feature Matrix

This shall be a comprehensive list of individual features that every FCS-compliant search engine must/can implement. However it is currently out of sync with the current specification discussed in detail in FCS-specification.

SRU/CQL proposes conformance levels, but this is too coarse-grained and fuzzy. We try to be clearer here and go down to the level of individual features.

Obviously the list is not complete. TODO: complete

explain response
Provide information about the configuration of the service, mainly about available indices and defaults.

We have to examine, if explain-record can be the single authoritative source of information about a repository, that would carry all the configuration information discussed below.

Query

simple term search (Conformance level 0)
accept a simple term query: a word or a phrase.
 query=fish
 query="system"
 query="language acquisition"
 query="She said \"Yes\""

It is a search for occurrences of a full word or a phrase. In the example queries above fish and "system" are instances of a word, and "language acquisition" and "She said \"Yes\"" are instances of a phrase.

TODO: decide when searching in lexica search only in the lemma-list or in the explanations as well (full-text)

honor search context (Conformance level 0)
Allow to restrict search to specific collections/resources specified as a parameter (see SearchContext)
 ?x-cmd-context={list[Resource-PID]}
wildcards (Conformance level 1-2)
support wildcards like
 wor*  *ove   *oo*  ?ool   b?t
There could be also a version encoded with the help of relation like: index startsWith wor
index search (Conformance level 1-2)
support searchClause-queries:
 index relation searchTerm

Examples are

dc.creator = anderson
title adj "wonderful feelings"
bib.dateIssued < 1998

See more about indices in next chapter.

supported relations single term (Conformance level 1-2)
what are the allowed relations in in a query of the form index rel single-search-term:
  • = - exact match (on token? on annotation?)
  • <, >=, <=, ==
  • regex? - regular expression
supported relations multi-term (Conformance level 1-2)
for a search clause with multiple terms (like "rat hat cat") SRU proposes:
  • any - any of the terms (OR)
  • all - all of the terms (AND)
  • adj - terms in that order - a phrase
  • within
  • encloses
  • all/window/#N - terms within a given window (#N) (SRU/CQL 2.0 proposal)

TODO: discuss the lists of relations above in more detail: what do we want exactly for indices?

supported relation modifiers (Conformance level 1-2)
  • \stem
  • \relevant
  • \fuzzy
  • \respectCase
  • \isoDate
  • \oid

honor VC as search context (Conformance level 1-2)
able to process a virtual collection as means of restricting the search-context.
boolean search (Conformance level 1-2)
 AND OR PROX

Examples:

system AND language
system OR language
system AND (language OR acquisition)
system NOT language /*read: AND NOT; it is not a unary operator. */

PROX is a special boolean proximity operator “allowing for the relative locations of the terms to be used in order to determine the resulting set of records”. It is also the only Boolean operator to take (Boolean) modifiers:

PROX /unit = {unit}
  /distance {comparison_operator} {number}
  /ordered|unordered

Examples for PROX:

cat prox/unit=word/distance>2/ordered hat
cat prox/unit=paragraph hat
sorting (Conformance level 1-2)

A dedicated context-set defines the sort-clause: sortBy, to be used at the end of a cql query:

"dinosaur" sortBy dc.date/sort.descending dc.title/sort.ascending

named queries (Conformance level 2)

The server can provide a unique identifier for a result set by means of header element: <resultSetId>. This id can be used in subsequent requests to reference the result using the index: cql.resultSetId, allowing referencing the result set within a query. Thus after receiving two result sets (with the ids a and b) one could request an intersection of those two via:

cql.resultSetId = "a" AND cql.resultSetId = "b"

or continue restricting the result with:

cql.resultSetId = "a" AND dc.title=cat

Along with <resultSetId> server may supply <resultSetIdleTime> - a good-faith estimate that the result set will remain available and unchanged (both in content and order).

Indices

There are following operations on indices:

  • search
  • scan (/aggregate)
  • output (/group, /sort)

It needs to be explicated in the description of the repository, which operations are supported on which indices.

In SRU/CQL indices are defined in context sets. We plan to introduce following context sets:

ccs
content indices like:
 kwic, pos, lemma
TODO: we need some distinction between a lemma as one annotation layer in the full-text content and as head of a lexicon entry.
isocat
supporting index-search on isocat data categories. (Mapping internal indices to isocat.)
cmd
content search supporting also MD-filters. ("intensional filter" - see CDMDC)

We should agree on some basic set of MD-indices at least for output that "should" be implemented to make life easier to software and users. Hot candidates are:

  • title/name (default string representation of the resource)
  • language
  • resource-type
  • creation/publication date
  • or simply inspire by the VLO-facets

Additionally especially regarding metadata indices (the cmd-context set) we have to agree how to use existing context sets like dublincore.

Search Result

provide result in ccs:Resource-format
FederatedSearch/ccsResource.xsd
provide resource reference
Resource@pid
provide DataView@type=X
In which formats can you provide the results?
provide DataView@type=kwic
Can you provide basic keyword in context view?
provide DataView@type=metadata
Do you provide information about the Resource (and/or ResourceFragment)? What kind of information (what metadata-fields - describe by @schema parameter?)
provide link to CMD-metadata
Fill ccs:DataView@pid/@ref with reference to CMD-record.
provide CMD-metadata
alternatively resolve and embed the CMD-record
   <ccs:DataView type="metadata" schema="{cmd-profile specific schema}"><cmd:CMD>

supporting operations

scan indices
implement scan-operation allowing to query the terms available in individual indices
Last modified 11 years ago Last modified on 04/22/13 15:04:18