FCS - Feature Matrix
This shall be a comprehensive list of individual features that every FCS-compliant search engine must/can implement. However it is currently out of sync with the current specification discussed in detail in FCS-specification.
SRU/CQL proposes conformance levels, but this is too coarse-grained and fuzzy. We try to be clearer here and go down to the level of individual features.
Obviously the list is not complete. TODO: complete
- explain response
- Provide information about the configuration of the service, mainly about available indices and defaults.
We have to examine, if
explain
-record can be the single authoritative source of information about a repository, that would carry all the configuration information discussed below.
Query
- simple term search (Conformance level 0)
-
accept a simple term query: a word or a phrase.
query=fish query="system" query="language acquisition" query="She said \"Yes\""
It is a search for occurrences of a full word or a phrase. In the example queries above fish and "system" are instances of a word, and "language acquisition" and "She said \"Yes\"" are instances of a phrase.
TODO: decide when searching in lexica search only in the lemma-list or in the explanations as well (full-text)
- honor search context (Conformance level 0)
-
Allow to restrict search to specific collections/resources specified as a parameter (see SearchContext)
?x-cmd-context={list[Resource-PID]}
- wildcards (Conformance level 1-2)
-
support wildcards like
wor* *ove *oo* ?ool b?t
There could be also a version encoded with the help ofrelation
like:index startsWith wor
- index search (Conformance level 1-2)
-
support searchClause-queries:
index relation searchTerm
Examples are
dc.creator = anderson title adj "wonderful feelings" bib.dateIssued < 1998
See more about indices in next chapter.
- supported relations single term (Conformance level 1-2)
-
what are the allowed relations in in a query of the form
index rel single-search-term
:=
- exact match (on token? on annotation?)<
,>=
,<=
,==
regex
? - regular expression
- supported relations multi-term (Conformance level 1-2)
-
for a search clause with multiple terms (like "rat hat cat") SRU proposes:
any
- any of the terms (OR)all
- all of the terms (AND)adj
- terms in that order - a phrasewithin
encloses
all/window/#N
- terms within a given window (#N) (SRU/CQL 2.0 proposal)
TODO: discuss the lists of relations above in more detail: what do we want exactly for indices?
- supported relation modifiers (Conformance level 1-2)
\stem
\relevant
\fuzzy
\respectCase
\isoDate
\oid
- honor VC as search context (Conformance level 1-2)
- able to process a virtual collection as means of restricting the search-context.
- boolean search (Conformance level 1-2)
-
AND OR PROX
Examples:
system AND language system OR language system AND (language OR acquisition) system NOT language /*read: AND NOT; it is not a unary operator. */
PROX is a special boolean proximity operator “allowing for the relative locations of the terms to be used in order to determine the resulting set of records”. It is also the only Boolean operator to take (Boolean) modifiers:
PROX /unit = {unit} /distance {comparison_operator} {number} /ordered|unordered
Examples for PROX:
cat prox/unit=word/distance>2/ordered hat cat prox/unit=paragraph hat
- sorting (Conformance level 1-2)
A dedicated context-set defines the sort-clause: sortBy, to be used at the end of a cql query:
"dinosaur" sortBy dc.date/sort.descending dc.title/sort.ascending
- named queries (Conformance level 2)
The server can provide a unique identifier for a result set by means of header element: <resultSetId>. This id can be used in subsequent requests to reference the result using the index: cql.resultSetId, allowing referencing the result set within a query. Thus after receiving two result sets (with the ids a and b) one could request an intersection of those two via:
cql.resultSetId = "a" AND cql.resultSetId = "b"
or continue restricting the result with:
cql.resultSetId = "a" AND dc.title=cat
Along with <resultSetId> server may supply <resultSetIdleTime> - a good-faith estimate that the result set will remain available and unchanged (both in content and order).
Indices
There are following operations on indices:
search
scan
(/aggregate
)output
(/group
, /sort
)
It needs to be explicated in the description of the repository, which operations are supported on which indices.
In SRU/CQL indices are defined in context sets. We plan to introduce following context sets:
- ccs
-
content indices like:
kwic, pos, lemma
TODO: we need some distinction between a lemma as one annotation layer in the full-text content and as head of a lexicon entry. - isocat
-
supporting index-search on
isocat
data categories. (Mapping internal indices toisocat
.) - cmd
- content search supporting also MD-filters. ("intensional filter" - see CDMDC)
We should agree on some basic set of MD-indices at least for
output
that "should" be implemented to make life easier to software and users. Hot candidates are:
- title/name (default string representation of the resource)
- language
- resource-type
- creation/publication date
- or simply inspire by the VLO-facets
Additionally especially regarding metadata indices (the cmd
-context set) we have to agree how to use existing context sets like dublincore.
Search Result
- provide result in ccs:Resource-format
- FederatedSearch/ccsResource.xsd
- provide resource reference
-
Resource@pid
- provide DataView@type=X
- In which formats can you provide the results?
- provide DataView@type=kwic
- Can you provide basic keyword in context view?
- provide DataView@type=metadata
- Do you provide information about the Resource (and/or ResourceFragment)? What kind of information (what metadata-fields - describe by @schema parameter?)
- provide link to CMD-metadata
- Fill ccs:DataView@pid/@ref with reference to CMD-record.
- provide CMD-metadata
-
alternatively resolve and embed the CMD-record
<ccs:DataView type="metadata" schema="{cmd-profile specific schema}"><cmd:CMD>
supporting operations
- scan indices
-
implement
scan
-operation allowing to query the terms available in individual indices