wiki:Taskforces/FCS/FCS-Specification-Draft

Version 5 (modified by Oliver Schonefeld, 9 years ago) (diff)

--

NOTE: This page is work-in-progress. Final draft is scheduled to be delivered by 2015-10-31.

CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0

The goal of the CLARIN Federated Content Search (CLARIN-FCS) - Core specification is to introduce an interface specification that decouples the search engine functionality from its exploitation, i.e. user-interfaces, third-party applications, and to allow services to access heterogeneous search engines in a uniform way.

Introduction

All following sub-sections to be updated as required.

Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC2119.

Glossary

Aggregator
A module or service to dispatch queries to repositories and collect results.
CLARIN-FCS, FCS
CLARIN federated content search, an interface specification to allow searching within resource content of repositories.
Client
A software component, which implements the interface specification to query Endpoints, i.e. an aggregator or a user-interface.
CQL
Contextual Query Language, previously known as Common Query Language, is a domain specific language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information.
Data View
A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation.
Data View Payload, Payload
The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation.
Endpoint
A software component, which implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine.
FCS-QL
Federated Content Search Query Language is the query language used in the advanced CLARIN-FCS profile. It is derived from Corpus Workbench's CQP-TUTORIAL
Hit
A piece of data returned by a Search Engine that matches the search criterion. What is considered a Hit highly depends on Search Engine.
Interface Specification
Common harmonized interface and suite of protocols that repositories need to implement.
PID
A Persistent identifier is a long-lasting reference to a digital object.
Repository
A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata).
Repository Registry
A separate service that allows registering Repositories and their Endpoints and provides information about these to other components, e.g. an Aggregator. The CLARIN Center Registry is an implementation of such a repository registry.
Resource
A searchable and addressable entity at an Endpoint, such as a text corpus or a multi-modal corpus.
Resource Fragment
A smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription.
Result Set
An (ordered) set of hits that match a search criterion produced by a search engine as the result of processing a query.
Search Engine
A software component within a repository, that allows for searching within the repository contents.
SRU
Search and Retrieve via URL, is a protocol for Internet search queries. Originally introduced by Library of Congress LOC-SRU12, later standardization process moved to OASIS OASIS-SRU12.

Normative References

RFC2119
Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997,
http://www.ietf.org/rfc/rfc2119.txt
XML-Namespaces
Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009,
http://www.w3.org/TR/2009/REC-xml-names-20091208/
OASIS-SRU-Overview
searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc (HTML), (PDF)
OASIS-SRU-APD
searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc (HTML) (PDF)
OASIS-SRU12
searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc (HTML) (PDF)
OASIS-CQL
searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc (HTML) (PDF)
SRU-Explain
searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc (HTML) (PDF)
SRU-Scan
searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc (HTML) (PDF)
LOC-SRU12
SRU Version 1.2: SRU Search/Retrieve Operation, Library of Congress,
http://www.loc.gov/standards/sru/sru-1-2.html
LOC-DIAG
SRU Version 1.2: SRU Diagnostics List, Library of Congress,
http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html
CLARIN-FCS-DataViews
CLARIN Federated Content Search (CLARIN-FCS) - Data Views, SCCTC FCS Task-Force, April 2014,
https://trac.clarin.eu/wiki/FCS/Dataviews

Non-Normative References

CQP-TUTORIAL
Evert et al.: The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, CWB Version 3.0, February 2010,
http://cwb.sourceforge.net/files/CQP_Tutorial/
RFC6838
Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013,
http://www.ietf.org/rfc/rfc6838.txt
RFC3023
XML Media Types, IETF RFC 3023, January 2001,
http://www.ietf.org/rfc/rfc3023.txt

Typographic and XML Namespace conventions

The following typographic conventions for XML fragments will be used throughout this specification:

  • <prefix:Element>
    An XML element with the Generic Identifier Element that is bound to an XML namespace denoted by the prefix prefix.
  • @attr
    An XML attribute with the name attr
  • string
    The literal string must be used either as element content or attribute value.

Endpoints and Clients MUST adhere to the XML-Namespaces specification. The CLARIN-FCS interface specification generally does not dictate whether XML elements should be serialized in their prefixed or non-prefixed syntax, but Endpoints MUST ensure that the correct XML namespace is used for elements and that XML namespaces are declared correctly. Clients MUST be agnostic regarding syntax for serializing the XML elements, i.e. if the prefixed or un-prefixed variant was used, and SHOULD operate solely on expanded names, i.e. pairs of namespace name and local name.

The following XML namespace names and prefixes are used throughout this specification. The column "Recommended Syntax" indicates which syntax variant SHOULD be used by the Endpoint to serialize the XML response.

Prefix Namespace Name Comment Recommended Syntax
fcs http://clarin.eu/fcs/resource CLARIN-FCS Resources prefixed
ed http://clarin.eu/fcs/endpoint-description CLARIN-FCS Endpoint Description prefixed
hits http://clarin.eu/fcs/dataview/hits CLARIN-FCS Generic Hits Data View prefixed
adv http://clarin.eu/fcs/dataview/advanced CLARIN-FCS Advanced Data View prefixed
sru http://www.loc.gov/zing/srw/ SRU prefixed
diag http://www.loc.gov/zing/srw/diagnostic/ SRU Diagnostics prefixed
zr http://explain.z3950.org/dtd/2.0/ SRU/ZeeRex Explain prefixed

Careful with the SRU Namespaces; they probably need to be adjusted SRU 2.0 (=> OASIS).

CLARIN-FCS Interface Specification

The CLARIN-FCS Interface Specification defines a set of capabilities, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms.

Specifically, the CLARIN-FCS Interface Specification consists of two major parts, a set of formats, and a transport protocol. The Endpoint component is a software component that acts as a bridge between a Client and a Search Engine and passes the requests sent by the Client to the Search Engine. The Search Engine is a custom software component that allows the search of language resources in a Repository. The Endpoint implements the Transport Protocol and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of Search Engines of the individual Repositories. The following figure illustrates the overall architecture:

                   +---------+
                   |  Client |
                   +---------+
                       /|\
                        |
                        | SRU / CQL
                        | w/CLARIN-FCS extensions
                        |
                       \|/
 +----------------------------------------------+
 |           |      Endpoint       /|\          |
 |           |                      |           |
 |  -------------------    -------------------  |
 | | translate request |  | translate result  | |
 |  -------------------    -------------------  |
 |           |                      |           |
 |          \|/                     |           |
 +----------------------------------------------+
                       /|\
                        |
                        | Search Engine specific protocols/formats
                        |
                       \|/
          +---------------------------+
          |       Search Engine       |
          +---------------------------+

In general, the work flow in CLARIN-FCS is as follows: a Client submits a query to an Endpoint. The Endpoint translates the query from CQL or FCS-QL to the query dialect used by the Search Engine and submits the translated query to the Search Engine. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends them to the Client.

"Discovery Phase"

Capabilities

Add and describe advanced capability.

Endpoint Description

Add stuff required for advanced capability.

"Search Phase"

"FCS-QL"

New Section.
More subsections for this section?

Result Format

Resource and ResourceFragment

Data View

Generic Hits (HITS)
Advanced (ADV)

New section.

"Versioning and Extensions"

"Backwards compatibility statements"

Say something about backwards compatibility with "basic-search".
Clients should also be compatible with FCS 1.0 (= SRU 1.2) and use heuristic to determine, if an endpoint is still using FCS 1.0.

Endpoint Custom Extensions

Talk about extensions in general; this section needs to stay in normative part due to the namespace stuff

CLARIN-FCS to SRU/CQL binding

SRU/CQL

SRU 2.0 requirement

Operation explain

Basically stays the same, but adjust for advanced stuff.

Operation scan

Basically stays the same, but adjust for advanced stuff (if required).

Operation searchRetrieve

Align with newly introduced section "Search Phase"
Define String for SRU query-lanaguge paramater ("fcs"? "clarin-fcs"?)

Normative Appendix

List of extra request parameters

Revisit and update as required; don't forget to add the new request parameter ("allow rewriting" => allow endpoint to trade precision in favor of recall).

List of diagnostics

Revisit and update as required; don't forget to add the 4 new diagnostics.

Non-normative Appendix

Syntax variant for Handle system Persistent Identifier URIs

Referring to an Endpoint from a CMDI record

Endpoint custom extensions

Endpoint highlight hints for repositories

All sections to be updated as required / maybe check if something should be deleted.

Attachments (1)

Download all attachments as: .zip