wiki:FCS-Specification-ScrapBook

Version 13 (modified by oschonef, 10 years ago) (diff)

--

FCS Specification Scrapbook

Issues with current document

  1. Uncomprehensible and not well structures :(
  2. Resource enumeration (aka scan on fcs.resource) rather complex and unintuitive
  3. Basic KWIC records has no provision for multiple "highlight" hits
  4. Clear recommendation for using Resource and ResouceFragment

General ideas / design goals towards better specification

  1. Define FCS conformance level independent of what SRU/CQL do. Don't call them "level", but maybe something like profile to avoid confusion.
    1. Do a basic profile first
    2. Do an advanced/extend profile later in a separate specification or specification amendment (which must be, of course, compatible to basic profile)
    3. Add provisions to, e.g. explain output, to allow endpoints to indicate the profile, they support
  2. Better structure of document (and don't include aggregation stuff; that's a different specification; implementors of endpoints should not need to worry about aggregator implementation)
  3. Keep XML sanity always in mind (so there are no namespace issues as in CMDI)
  4. Honor and use extension hooks provided by SRU/CQL

Proposal for new specification

The following is a proposal for a revisited federated content search specification. When done, cut and paste to the appropriate section of the Wiki and publish on the CLARIN web page.

CLARIN Federated Content Search (CLARIN-FCS)

Introduction

The main goal of CLARIN federated content search (CLARIN-FCS) is to introduce a interface specification, to decouple the search engine functionality from its exploitation, i.e. user-interfaces, third-party applications and to allow services to access search engines in an uniform way.

Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC2119.

Glossary

Aggregator
A module or service to dispatch queries to repositories and collect results.
CLARIN-FCS, FCS
CLARIN federated content search, an interface specification to allow searching within resource content of repositories.
Client
A software component, that implements the interface specification to query endpoints, i.e. an aggregator or an user-interface.
CQL
Contextual Query Language, previously known as Common Query Language, is a formal language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information.
Endpoint
A software component, that implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine.
Interface Specification
Common harmonized interface and suite of protocols that repositories need to implement.
Search Engine
A software component within a repository, that allows for searching within the repository contents.
SRU
Search and Retrieve via URL, is a protocol for Internet search queries.
Data View
A data views is a mechanism to support different representations of search results, e.g. Keyword-In-Context view, an Image or a KML encoded Geo-location.
PID
A Persistent identifier is a long-lasting reference to a digital object.
Repository
A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata).
Repository Registry
A separate service that allows registering endpoints and provides information about these to other components, e.g. an aggegator. The CLARIN Center Registry is an implementation of such a repository registry.

Normative References

RFC2119
Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997,
http://www.ietf.org/rfc/rfc2119.txt
OASIS-SRU-Overview
searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc (HTML), (PDF)
OASIS-SRU-APD
searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc (HTML) (PDF)
OASIS-SRU12
searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc (HTML) (PDF)
OASIS-CQL
searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc (HTML) (PDF)
SRU-Explain
searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc (HTML) (PDF)
SRU-Scan
searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2014,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc (HTML) (PDF)
LOC-SRU12
SRU VERSION 1.2: SRU Search/Retrieve Operation, Library of Congress,
http://www.loc.gov/standards/sru/sru-1-2.html

CLARIN-FCS Interface Specification

The CLARIN-FCS interface specification defined two profiles, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms.

The following sections describe the profiles, how SRU/CQL is used in the context of CLARIN-FCS and the CLARIN-FCS specific extions to SRU.

Profiles

CLARIN-FCS supports two profiles:

Basic profile
Endpoints MUST support term-only queries.
Endpoints SHOULD support terms combined with boolean operator (AND and OR) queries, including subqueries. Endpoints MAY support the NOT or PROX operators. If an endpoint does not support such a query, it MUST return an appropriate error message using the appropriate SRU diagnostic.
Examples for valid CQL queries :
cat
"cat"
cat AND dog
"grumpy cat"
"grumpy cat" AND dog
"grumpy cat" OR "lazy dog"
cat AND (mouse OR "lazy dog")
The endpoint is MUST perform the query on an annotation tier, that makes the most sense for the user, i.e. the textual content for a text corpus resource or the orthographic transcription of a spoken language corpus. Endpoints are RECOMMENDED to perform the query case-sensitive.
Endpoint MUST NOT silently accept queries that include CQL features besides term-only and terms combined with boolean operator queries, i.e. queries involving context sets, etc.
Extended profile
This profile will support more sophisticated queries such as selecting annotation tiers, expanding of tags, or mapping of data categories.
NOTE: the extended profile is not yet defined and will be part of a future CLARIN-FCS specification.

Endpoints and Clients MUST support the basic profile. Endpoints and Clients MUST NOT claim to support the extended profile.

SRU/CQL

SRU (Search/Retrieve via URL) specifies a general communication protocol for searching and retrieving records and the CQL (Contextual Query Language) specifies a extensible query language. CLARIN-FCS is built on SRU 1.2. A subsequent specification may be built on SRU 2.0.

Endpoints and Clients MUST implement the SRU/CQL protocol suite as defined in OASIS-SRU-Overview, OASIS-SRU-APD, OASIS-CQL, SRU-Explain, SRU-Scan, especially with respect to:

  • Data Model,
  • Query Model,
  • Processing Model,
  • Result Set Model, and
  • Diagnostics Model

Endpoints and Clients MUST use the implement the APD Binding for SRU 1.2, as defined in OASIS-SRU-12. Endpoints and Clients MAY implement APD binding for version 1.1 or version 2.0.

Endpoints and Clients MUST use the following namespace URIs for serializing responses:

  • http://www.loc.gov/zing/srw/ for SRU response documents, and
  • http://www.loc.gov/zing/srw/diagnostic/ for diagnostics within SRU response documents.

CLARIN-FCS deviates from the OASIS specification OASIS-SRU-Overview and OASIS-SRU-12 to ensure backwards comparability with SRU 1.2 services as they where defined by the LOC-SRU12.

Endpoints or Clients MUST support CQL conformance Level 2 (as defined in OASIA-CQL, section 6), i.e. be able to parse (Endpoints) or serialize (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface.

NOTE: this does not imply, that Endpoints are required support for all of CQL, but rather that they are able to parse all of CQL and generate the appropriate error message, if a query includes a feature they do not support.

Result Format

Resources and ResourceFragments

WIP: will be reformulated

In CLARIN-FCS, each <sru:record> element represents one hit within the resource, which is encoded as a <fcs:Resource> element. Each resource shall be identified a persistent identifier (or, less preferably, a endpoint unique URI). The correct resource to return here is the most precise unit of data that is directly addressable as a "whole". The hit may contain a resource fragment, which is encoded as as a <fcs:ResourceFragment> element. The resource fragment shall be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using resource fragments is optional, but encouraged.

The actual hit within a resource is provided encoded as a data view format and is serialized as a <fcs:DataView> element inside either the <fcs:Resource> or <fcs:ResourceFragment> element. Each hit may be serialized as multiple data views, however the keyword-in-context (KWIC) data view is mandatory with the resource fragment (if applicable), or otherwise within the resource (if there is no reasonable resource fragment). Other data views should be put in a place that is logical for their content (as is to be determined by the endpoint. E.g. a metadata data view would most likely be put directly under a resource. On the other hand a data view representing some annotation layers directly around the hit is more likely to belong in within the resource fragment.

Each entity (i.e. <fcs:Resource>, <fcs:ResourceFragment> or <fcs:DataView> element) contains a ref attribute, which points to the original data represented by the resource, resource fragment, or data view as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a web-page describing a corpus or collection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. Endpoints should provide links that are as specific as possible/logical.

For CLARIN-FCS, a custom record schema has been defined. The record schema identifier for this schema is http://clarin.eu/fcs/1.0 and the appropriate XML Schema can be found at Resource.xsd (download).

Data Views

WIP: will be reformulated

The data views are designed to allow for different representations of search results within CLARIN-FCS. They are deliberately kept open to allow further extensions in the future with more supported data view formats.

The type of each data view is identified by the type attribute of the <fcs:DataView> element. The value if defined to be a MIME type. If no existing MIME type can be used, implementors are encouraged to define a properer private mime type. The following formats are currently being considered:

Keyword-In-Context (KWIC)
Description: a keyword-in-context view, where each hit should be presented within the context of a complete sentence (if possible) or any other reasonable unit of context (e.g. if sentences cannot be determined by the endpoint). The keyword-in-context data view is mandatory for all endpoints. The appropriate XML schema can be found at Resource-KWIC.xsd (download).
Type: application/x-clarin-fcs-kwic+xml
Example for a keyword-in-context data view (formatted for brevity):
<fcs:DataView type="application/x-clarin-fcs-kwic+xml">
  <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
    <kwic:c type="left">Some text with the </kwic:c>
    <kwic:kw>keyword</kwic:kw>
    <kwic:c type="right">highlighted.</kwic:c>
  </kwic:kwic>
</fcs:DataView>
CMDI metadata
Description: a CMDI metadata record applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resource fragment is should apply specifically to the resource fragment (i.e., more specialized than the metadata for the encompassing resource). The minimal XML schema for CMDI records can be found at minimal-cmdi.xsd (download). For further information refer to the Component Metadata pages.
Type: application/x-cmdi+xml
Geolocation (KML)
Description: A geographic location encoded in the Keyhole Markup Language. Please refer to the KML standard for further information and the current KML XML schema.
Type: application/vnd.google-earth.kml+xml

CLARIN-FCS to SRU/CQL binding

Yada yada yada ...

Endpoint Identification

Is mapped to SRU explain operation. Yada yada ...

Performing Queries and returning results

Is mapped to SRU SearchRetrieve operation. Yada yada ...