wiki:FCS-Specification-ScrapBook

Version 36 (modified by oschonef, 10 years ago) (diff)

--

FCS Specification Scrapbook

Issues with current document

  1. Uncomprehensible and not well structured :(
  2. Resource enumeration (aka scan on fcs.resource) rather complex and unintuitive
  3. Basic KWIC records has no provision for multiple "highlight" hits
  4. Clear recommendation for using Resource and ResouceFragment
  5. What about recursiveness in Resource (see current schema)? What is the use case?

General ideas / design goals towards better specification

  1. Define FCS conformance level independent of what SRU/CQL do. Don't call them "level", but maybe something like profile to avoid confusion.
    1. Do a basic profile first
    2. Do an advanced/extend profile later in a separate specification or specification amendment (which must be, of course, compatible to basic profile)
    3. Add provisions to, e.g. explain output, to allow endpoints to indicate the profile, they support
  2. Better structure of document (and don't include aggregation stuff; that's a different specification; implementors of endpoints should not need to worry about aggregator implementation)
  3. Keep XML sanity always in mind (so there are no namespace issues as in CMDI)
  4. Drop resource enumeration in favor of endpoint resource description
  5. Drop the recursiveness of Resource, content models should be: Resource (DataView*, ResourceFragment*) and ResourceFragment (DataView*)
  6. Drop the KWIC data view in favor of HITS data view; the latter will allow for multiple hit highlights
  7. Honor and use extension hooks provided by SRU/CQL
  8. Non-normative stuff
    1. Endpoint specific extension hooks, e.g. to avoid tag abuse of DataView. Resource.xsd could provide an extension hook, so arbitrary XML could also be embedded.
    2. Clients can put query parameters at @ref to allow hit highlighting on their systems

Proposal for new specification

The following is a proposal for a revisited federated content search specification. When done, cut and paste to the appropriate section of the Wiki and publish on the CLARIN web page.


CLARIN Federated Content Search (CLARIN-FCS)

Introduction

The main goal of CLARIN federated content search (CLARIN-FCS) is to introduce a interface specification, to decouple the search engine functionality from its exploitation, i.e. user-interfaces, third-party applications and to allow services to access search engines in an uniform way.

Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC2119.

Glossary

Aggregator
A module or service to dispatch queries to repositories and collect results.
CLARIN-FCS, FCS
CLARIN federated content search, an interface specification to allow searching within resource content of repositories.
Client
A software component, that implements the interface specification to query endpoints, i.e. an aggregator or an user-interface.
CQL
Contextual Query Language, previously known as Common Query Language, is a formal language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information.
Endpoint
A software component, that implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine.
Interface Specification
Common harmonized interface and suite of protocols that repositories need to implement.
Search Engine
A software component within a repository, that allows for searching within the repository contents.
SRU
Search and Retrieve via URL, is a protocol for Internet search queries.
Data View
A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation.
Data View Payload, Payload
The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation.
PID
A Persistent identifier is a long-lasting reference to a digital object.
Repository
A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata).
Repository Registry
A separate service that allows registering endpoints and provides information about these to other components, e.g. an aggegator. The CLARIN Center Registry is an implementation of such a repository registry.

Normative References

RFC2119
Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997,
http://www.ietf.org/rfc/rfc2119.txt
OASIS-SRU-Overview
searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc (HTML), (PDF)
OASIS-SRU-APD
searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc (HTML) (PDF)
OASIS-SRU12
searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc (HTML) (PDF)
OASIS-CQL
searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc (HTML) (PDF)
SRU-Explain
searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc (HTML) (PDF)
SRU-Scan
searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013,
http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc (HTML) (PDF)
LOC-SRU12
SRU Version 1.2: SRU Search/Retrieve Operation, Library of Congress,
http://www.loc.gov/standards/sru/sru-1-2.html
LOC-DIAG
SRU Version 1.2: SRU Diagnostics List, Library of Congress,
http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html

Non-Normative References

RFC6838
Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013,
http://www.ietf.org/rfc/rfc6838.txt
RFC3023
XML Media Types, IETF RFC 3023, January 2001,
http://www.ietf.org/rfc/rfc3023.txt
KML
Keyhole Markup Language (KML), Open Geospatial Consortium, 2008,
http://www.opengeospatial.org/standards/kml

CLARIN-FCS Interface Specification

The CLARIN-FCS interface specification defined two profiles, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms.

Generally, CLARIN-FCS Interface Specification consists of two components, a set of formats and a transport protocol. The Endpoint component is a software component that acts as a bridge between the Formats, that are send by a Client using the Transport Protocol, and a Search Engine. The Search Engine is a custom software component, that allows searching in the language resources of a CLARIN center. The Endpoint basically implements the transport protocol and acts as an mediator between the CLRAIN-FCS specific formats and the idiosyncrasies of Search Engines. The following figure illustrates the overall architecture.

                 +---------+
                 |  Client |
                 +---------+
                     /|\
                      |
          -------------------------
         |        SRU / CQL        |
         | w/CLARIN-FCS extensions |
          -------------------------
                      |
                     \|/
 +-----------------------------------------+
 |        |      Endpoint     /|\          |
 |        |                    |           |
 |  ---------------    ------------------  |
 | | Translate CQL |  | Translate Result | |
 |  ---------------    ------------------  |
 |        |                    |           |
 |       \|/                   |           |
 +-----------------------------------------+
                     /|\
                      |
                     \|/
        +---------------------------+
        |       Search Engine       |
        +---------------------------+

The following sections describe the CLARIN-FCS profiles and query and result formats, how SRU/CQL is used as a transport protocol in the context of CLARIN-FCS and the required CLARIN-FCS specific extensions to SRU.

Profiles

CLARIN-FCS defines two profiles:

Basic profile
Endpoints MUST support term-only queries.
Endpoints SHOULD support terms combined with boolean operator queries (AND and OR), including subqueries. Endpoints MAY also support NOT or PROX operator queries. If an endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it MUST return an appropriate error message using the appropriate SRU diagnostic.
Examples for valid CQL queries for the basic profile:
cat
"cat"
cat AND dog
"grumpy cat"
"grumpy cat" AND dog
"grumpy cat" OR "lazy dog"
cat AND (mouse OR "lazy dog")
The endpoint is MUST perform the query on an annotation tier, that makes the most sense for the user, i.e. the textual content for a text corpus resource or the orthographic transcription of a spoken language corpus. Endpoints SHOULD perform the query case-sensitive.
Endpoint MUST NOT silently accept queries that include CQL features besides term-only and terms combined with boolean operator queries, i.e. queries involving context sets, etc.
Extended profile
This profile will support more sophisticated queries such as selecting annotation tiers, expanding of tags, or mapping of data categories.
NOTE: the extended profile is not yet defined and will be part of a future CLARIN-FCS specification.

Endpoints and Clients MUST support the basic profile. For now, Endpoints and Clients MUST NOT claim to support the extended profile.

Result Format

CLARIN-FCS uses a customized format for returning results. Resource and Resource Fragments serve as containers for hit results, that are presented in one or more Data View. The following section describes the result result and data view format and section Operation ''searchRetrieve'' will describe, how hits are embedded within SRU responses.

Resource and ResourceFragment

To encode search results, CLARIN-FCS supports two building blocks:

Resources
A Resource is an searchable entity at an Endpoint, such as a text corpus or an multi-modal corpus. A resource SHOULD be a self-contained unit, i.e. not a sentence in a text corpus or a time interval in an audio transcription.
Resource Fragments
A Resource Fragment is smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription.

A Resource SHOULD be the most precise unit of data that is directly addressable as a "whole". A Resource SHOULD contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment SHOULD be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is OPTIONAL, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit SHOULD be encoded as a Data View that is encoded in a Resource Fragment.

Endpoints SHOULD always provide a links to the resource itself, i.e. each Resource or Resource Fragment SHOULD be identified by a persistent identifier or providing an Endpoint unique URI. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints SHOULD provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints SHOULD provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment SHOULD NOT contain a persistent identifier or an URI.

If an Endpoint can provide both, a persistent identifier as well as an URI, for either Resource or Resource Fragment, they SHOULD provide both. When working with results, Clients SHOULD prefer persistent identifiers over regular URIs.

Resource and Resource Fragment are serialized in XML and Endpoints MUST generate responses, that are valid according to the XML schema "Resource.xsd" (download). A Resource is encoded in the form of a <fcs:Resource> element, a Resource Fragment in the form of a <fcs:ResourceFragment> element. The content of a Data View is wrapped in a <fcs:DataView> element. <fcs:Resource> is the top-level element and MAY contain zero or more <fcs:DataView> elements and MAY contain zero or more <fcs:ResourceFragment> elements. A <fcs:ResourceFragment> element MUST contain one or more <fcs:DataView> elements.

The elements <fcs:Resource>, <fcs:ResourceFragment> and <fcs:DataView> MAY carry a @pid and/or a @ref attribute, which allows linking to the original data represented by the Resource, Resource Fragment, or Data View. A @pid attribute MUST contain a valid persistent identifier, a @ref MUST contain valid URI, i.e. a "plain" URI without the additional semantics of being a persistent reference.

Endpoints MUST use the identifier http://clarin.eu/fcs/1.0 for the responseItemType (= content for the <sru:recordSchema> element) in SRU responses.

Endpoints MAY serialize hits as multiple Data Views, however they MUST provide the Generic Hits (HITS) Data View either encoded as a Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable resource fragment). Other Data Views SHOULD be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata data view would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment.

Example 1:

<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/00-15">
  <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
  </fcs:DataView>
</fcs:Resource>

This example shows a simple hit, which is encoded in one Data View of type Generic Hits embedded within a Resource. The type of the Data View is identified by the MIME type application/x-clarin-fcs-hits+xml. The Resource is referenceable by the persistent identifier http://hdl.handle.net/4711/08-15.

Example 2:

<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/08-15">
  <fcs:ResourceFragment>
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
    </fcs:DataView>
  </fcs:ResourceFragment>
</fcs:Resource>

This example shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type Generic Hits. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier http://hdl.handle.net/4711/08-15. In contrast to Example 1, the endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document.

Example 3:

<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0"
              pid="http://hdl.handle.net/4711/08-15" ref="http://repos.example.org/file/text_08_15.html">
  <fcs:DataView type="application/x-cmdi+xml"
                pid="http://hdl.handle.net/4711/08-15-1" ref="http://repos.example.org/file/08_15_1.cmdi">
      <!-- data view payload omitted -->
  </fcs:DataView>
  <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" ref="http://repos.example.org/file/text_08_15.html#sentence2">
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <!-- data view payload omitted -->
    </fcs:DataView>
  </fcs:ResourceFragment>
</fcs:Resource>

The most complex #REF_Example_3 example] is similar to Example 2, i.e. it shows a hit is encoded as one Generic Hits Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type CMDI is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. All entities of the Hit can be referenced by a persistent identifier and an URI. The complete Resource is referenceable by either the persistent identifier http://hdl.handle.net/4711/08-15 or the URI http://repos.example.org/file/text_08_15.html and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier http://hdl.handle.net/4711/08-15-1 or the URI http://repos.example.org/file/08_15_1.cmdi. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier http://hdl.handle.net/4711/00-15-2 or the URI http://repos.example.org/file/text_08_15.html#sentence2.

Endpoints MUST serialize one Resource for each hit, i.e. they MUST NOT combine several hits in one Resource. E.g., if a query matches five different sentences within one text (= the resource), the endpoint must serialize five Resource (= one per hit) and embed each within one SRU result (see below).

Data View

A Data View serves as a container for representing search results within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e. they are deliberately kept open to allow further extensions with more supported data view formats. The content of a Data View is called Payload. Each Payload is typed and the type of the Payload is recorded in the @type attribute if the <fcs:DataView> element. The Payload type is is identified by a MIME type (RFC6838, RFC3023). If no existing MIME type can be used, implementors SHOULD define a properer private mime type.

The Payload of a Data View can either be deposited inline or by reference. In the case of inline, it MUST be serialized as an XML fragment below the <fcs:DataView> element. This method is the preferred methods payloads that can easily serialized in XML. In the case of by reference, the content cannot easily deposited inline, i.e. it is binary content. In this case, the Data View MUST include a @ref or @pid attribute that links location for Clients to download the payload. This location SHOULD be openly accessible, i.e. data can be downloaded freely without any need to perform a login.

For the basic profile, the Data Views Generic Hits, Component Metadata, Image and Geolocation are defined in this specification. Endpoints MAY define custom Data Views, but Clients conforming to the basic profile MAY choose to ignore them. The Generic Hits Data View is mandatory, thus all Endpoints MUST provide hits represented in the Generic Hits Data View.

NOTE: The examples in the following sections show only the payload with the enclosing <fcs:DataView> element of a Data View. Of course, the Data View must be embedded either in a <fcs:Resource> or a <fcs:ResourceFragment> element. The @pid and @ref attributes have been omitted for all inline payload types.

Generic Hits (HITS)
Description The representation of the hit
MIME type application/x-clarin-fcs-hits+xml
Payload Disposition inline

The Generic Hits Data View contains the serialization of a search result hit. It supports multiple maskers for suppling highlighting for the hit. Each hit SHOULD be presented within the context of a complete sentence. If that is not possible due to the nature of the type of the resource, the the Endpoint SHALL provide an equivalent reasonable unit of context (e.g. within a phrase of a orthographic transcription of an utterance). All Endpoints MUST provide hits represented in this Data View. The XML fragment of the Generic Hits payload MUST be valid according to the XML schema "DataView-Hits.xsd" (download).

  • Example (single hit marker):
    <!-- potential @pid and @ref attributes omitted -->
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <hits:Result xmlns:hits="http://clarin.eu/fcs/1.0/hits">
        The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog.
      </hits:Result>
    </fcs:DataView>
    
  • Example (multiple hit markers):
    <!-- potential @pid and @ref attributes omitted -->
    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
      <hits:Result xmlns:hits="http://clarin.eu/fcs/1.0/hits">
        The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.
      </hits:Result>
    </fcs:DataView>
    
Component Metadata (CMDI)
Description A CMDI metadata record
MIME type application/x-cmdi+xml
Payload Disposition inline or reference

The Component Metadata Data View allows to embed a CMDI metadata record that applicable to the specific context into the Endpoint response, e.g. metadata about the resource in which the hit was produced. If this CMDI record is applicable for the entire Resource, is SHOULD be put in a <fcs:DataView> element below the <fcs:Resource> element. If it is applicable to the Resource Fragment, i.e. it contains more specialized metadata than the metadata for the encompassing resource, it SHOULD be put in a <fcs:DataView> element below the <fcs:ResourceFragment> element. Endpoints SHOULD provide the payload inline, but Endpoints MAY also use the reference method. If an Endpoint uses the reference method, the CMDI metadata record MUST be downloadable without any restrictions.

  • Example (inline):
    <!-- potential @pid and @ref attributes omitted -->
    <fcs:DataView type="application/x-cmdi+xml">
      <CMD xmlns="http://www.clarin.eu/cmd/" CMDVersion="1.1">
        <!-- content omitted -->
      </CMD>
    </fcs:DataView>
    
  • Example (referenced):
    <!-- potential @pid attribute omitted -->
    <fcs:DataView type="application/x-cmdi+xml" ref="http://repos.example.org/resources/4711/0815.cmdi" />
    
Images (IMG)
Description An image related to the hit
MIME type image/png, image/jpeg, image/gif, image/svg+xml
Payload Disposition reference

The Image Data View allows top provide an image, that is relevant to the hit, e.g. a facsimile of the source of a transcription. Endpoints MUST provide the payload by the reference method and the image file SHOULD be downloadable without any restrictions.

  • Example:
    <!-- potential @pid attribute omitted -->
    <fcs:DataView type="image/png" ref="http://repos.example.org/resources/4711/0815.png" />
    
Geolocation (GEO)
Description An geographic location related to the hit
MIME type application/vnd.google-earth.kml+xml
Payload Disposition inline

The Geolocation Data View allows to geolocalize a hit. If MUST be encoded using the XML representation of the Keyhole Markup Language (KML). The KML fragment MUST comply with the specification as defined by KML.

  • Example:
    <!-- potential @pid and @ref attributes omitted -->
    <fcs:DataView type="application/vnd.google-earth.kml+xml">
      <kml xmlns="http://www.opengis.net/kml/2.2">
        <Placemark>
          <name>IDS Mannheim</name>
          <description>Institut für Deutsche Sprache, R5 6-13, 68161 Mannheim, Germany</description>
          <Point>
            <coordinates>8.4719510,49.4883700,0</coordinates>
          </Point>
        </Placemark>
      </kml>
    </fcs:DataView>
    

Endpoint Description

Endpoints need to provide information about their capabilities to support auto-configuration of Clients, This capabilities include, among other information, the Profile that is supported by an Endpoint. The Endpoint Description mechanism provides the necessary facility to provide this information to the Clients. Endpoints MUST encode their capabilities using an XML format and embed this information into the SRU/CQL protocol as described in section Operation ''explain''. The XML fragment generated by the Endpoint for the Endpoint Description MUST be valid according to the XML schema "Endpoint-Description.xsd" (download).

The XML fragment for Endpoint Description is encoded as an <ed:EndpointDescription> element, that contains the following children:

  • one <ed:Profile> element (REQUIRED)
    The value of the <ed:Profile> element indicates the Profile, that is supported by the Endpoint.
    Valid values are:
    • basic: the Endpoint supports the basic Profile
    NOTE: a future CLARIN-FCS specification will introduce more values.
  • one <ed:SupportedDataViews> (REQUIRED)
    A list of Data Views, that are supported by this Endpoint. This list is composed of one or more <ed:SupportedDataView> elements. The content of a <ed:SupportedDataView> MUST be the MIME type of a supported Data View, e.g. application/x-clarin-fcs-hits+xml.
  • one <ed:Collections> element (REQUIRED)
    A list of (top-level) collections that are available at the Endpoint. The <ed:Collections> element contains one or more <ed:Collection> elements (see below). An Endpoint MUST declare at least one (top-level) collection.

The <ed:Collection> element contains a detailed description of a collection that is available at an Endpoint. A collection is a searchable entity, e.g. a single corpus. The <ed:Collection> has a mandatory @pid attribute, that contains persistent identifier of the collection. This value MUST be the same as the MdSelfLink of the CMDI record describing the collection. The <ed:Collection> element contains the following children:

  • one or more <ed:Title> elements (REQUIRED)
    A human readable title for the collection. A REQUIRED @xml:lang attribute indicates the language of the title. An English title is REQUIRED. The list of titles MUST NOT contain duplicate entries for the same language.
  • zero or more <ed:Description> elements (OPTIONAL)
    An optional human-readable description of the collection. Is SHOULD be at most one sentence. A REQUIRED @xml:lang attribute indicates the language of the description. If supplied, an English version is REQUIRED. The list of descriptions MUST NOT contain duplicate entries for the same language.
  • zero or one <ed:LandingPageURI> element (OPTIONAL)
    A link to a website for this collection, e.g. a landing page for a collection, i.e. a web-site that describes a corpus.
  • one <ed:Languages> element (REQUIRED)
    The (relevant) languages available within the collection. The <ed:Languages> element contains one or more <ed:Language> elements. The content of a <ed:Language> element MUST be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available within the collection, however this list MUST NOT contain duplicate entries.
  • zero or one <ed:Collections> element (OPTIONAL)
    If a collection has searchable sub-collections the Endpoint MUST supply additional finer grained collection elements, which are wrapped in a <ed:Collections> element. A sub-collection is a searchable entity within a collection, e.g. a sub-corpus.

Example 1:

<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/1.0/endpoint-description">
  <ed:Profile>basic</ed:Profile>
  <ed:SupportedDataViews>
    <ed:SupportedDataView>application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:Collections>
    <!-- just one top-level collection at the Endpoint -->
    <ed:Collection pid="http://hdl.handle.net/4711/0815">
      <ed:Title xml:lang="de">Goethe Corpus</ed:Title>
      <ed:Title xml:lang="en">Goethe Korpus</ed:Title>
      <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description>
      <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
      <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
    </ed:Collection>
  </ed:Collections>
</ed:EndpointDescription>

This example shows a simple Endpoint Description for an Endpoint that supports the basic Profile and only provides the Generic Hits Data View. It only provides one top-level collection identified by the persistent identifier http://hdl.handle.net/4711/0815. The collection a title as well as a description in German and English. A landing page is located at http://repos.example.org/corpus1.html. The searchable collection contents are only available in German.

Example 2:

<ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/1.0/endpoint-description">
  <ed:Profile>basic</ed:Profile>
  <ed:SupportedDataViews>
    <ed:SupportedDataView>application/x-clarin-fcs-hits+xml</ed:SupportedDataView>
    <ed:SupportedDataView>application/x-cmdi+xml</ed:SupportedDataView>
  </ed:SupportedDataViews>
  <ed:Collections>
    <!-- top-level collection 1 -->
    <ed:Collection pid="http://hdl.handle.net/4711/0815">
      <ed:Title xml:lang="de">Goethe Corpus</ed:Title>
      <ed:Title xml:lang="en">Goethe Korpus</ed:Title>
      <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description>
      <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
      <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
    </ed:Collection>
    <!-- top-level collection 2 -->
    <ed:Collection pid="http://hdl.handle.net/4711/0816">
      <ed:Title xml:lang="de">Mannheimer Morgen newspaper Corpus</ed:Title>
      <ed:Title xml:lang="en">Zeitungskorpus des Mannheimer Morgen</ed:Title>
      <ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI>
      <ed:Languages>
        <ed:Language>deu</ed:Language>
      </ed:Languages>
      <ed:Collections>
        <!-- sub-collection 1 of top-level collection 2 -->
        <ed:Collection pid="http://hdl.handle.net/4711/0816-1">
          <ed:Title xml:lang="de">Mannheimer Morgen newspaper Corpus (before 1990)</ed:Title>
          <ed:Title xml:lang="en">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title>
          <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI>
          <ed:Languages>
            <ed:Language>deu</ed:Language>
          </ed:Languages>
        </ed:Collection>
        <!-- sub-collection 2 of top-level collection 2 -->
        <ed:Collection pid="http://hdl.handle.net/4711/0816-2">
          <ed:Title xml:lang="de">Mannheimer Morgen newspaper Corpus (after 1990)</ed:Title>
          <ed:Title xml:lang="en">Zeitungskorpus des Mannheimer Morgen (nach 1990)</ed:Title>
          <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub2</ed:LandingPageURI>
          <ed:Languages>
            <ed:Language>deu</ed:Language>
          </ed:Languages>
        </ed:Collection>
      </ed:Collections>
    </ed:Collection>
  </ed:Collections>
</ed:EndpointDescription>

This more complex example show a Endpoint Description for an Endpoint that, similar to Example 1, supports the basic profile. In addition to the Generic Hits Data View it also supports CMDI the CMDI Data View. The Endpoint has two top-level collections (identified by the persistent identifiers http://hdl.handle.net/4711/0815 and http://hdl.handle.net/4711/0816. The second top-level collection has two sub-collections, identified by the persistent identifier http://hdl.handle.net/4711/0816-1 and http://hdl.handle.net/4711/0816-2. All collections are described using several properties, like title, description, etc.

CLARIN-FCS to SRU/CQL binding

SRU/CQL

SRU (Search/Retrieve via URL) specifies a general communication protocol for searching and retrieving records and the CQL (Contextual Query Language) specifies a extensible query language. CLARIN-FCS is built on SRU 1.2. A subsequent specification may be built on SRU 2.0.

Endpoints and Clients MUST implement the SRU/CQL protocol suite as defined in OASIS-SRU-Overview, OASIS-SRU-APD, OASIS-CQL, SRU-Explain, SRU-Scan, especially with respect to:

  • Data Model,
  • Query Model,
  • Processing Model,
  • Result Set Model, and
  • Diagnostics Model

Endpoints and Clients MUST use the implement the APD Binding for SRU 1.2, as defined in OASIS-SRU-12. Endpoints and Clients MAY implement APD binding for version 1.1 or version 2.0.

Endpoints and Clients MUST use the following namespace URIs for serializing responses:

  • http://www.loc.gov/zing/srw/ for SRU response documents, and
  • http://www.loc.gov/zing/srw/diagnostic/ for diagnostics within SRU response documents.

CLARIN-FCS deviates from the OASIS specification OASIS-SRU-Overview and OASIS-SRU-12 to ensure backwards comparability with SRU 1.2 services as they where defined by the LOC-SRU12.

Endpoints or Clients MUST support CQL conformance Level 2 (as defined in OASIS-CQL, section 6), i.e. be able to parse (Endpoints) or serialize (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface.

NOTE: this does not imply, that Endpoints are required to support all of CQL, but rather that they are able to parse all of CQL and generate the appropriate error message, if a query includes a feature they do not support.

Endpoints MUST generate diagnostics according to OASIS-SRU-12, Appendix C for error conditions or to indicate unsupported features. Unfortunately, the OASIS specification does not provides a comprehensive list of diagnostics for CQL related errors. Therefore, Endpoints MUST use diagnostics from LOC-DIAG, section "Diagnostics Relating to CQL" for CQL related errors.

Operation explain

Yada yada ...

Operation scan

The SRU operation scan is currently not used in the basic profile of CLARIN-FCS. An extended profile may use this operation, therefore Endpoints are NOT RECOMMENDED to define custom extensions that use operation.

Operation searchRetrieve

Yada yada ...

Appendix: Non-normative

The following sections are non-normative.

Referring to an Endpoint from a CMDI record

Centers are encouraged to provide links to their CLARIN-FCS Endpoints in the metadata records for their resources. Other services, like the VLO, can use this information for automatically configuring an Aggregator for searching resources at the Endpoint. To refer to an Endpoint a <cmdi:ResourceProxy> with <cmdi:ResourceType> set to the value SearchService and a @mimetype attribute with a value of application/sru+xml need to be added to the CMDI record. The content of the <cmdi:ResourceRef> element must contain an URI that points to the Endpoint web service.

  • Example:
    <CMD xmlns="http://www.clarin.eu/cmd/" CMDVersion="1.1">
      <!-- ...  -->
      <Resources>
        <ResourceProxyList>
          <!-- ... -->
          <ResourceProxy id="r4711">
            <ResourceType mimetype="application/sru+xml">SearchService</ResourceType>
            <ResourceRef>http://repos.example.org/fcs-endpoint</ResourceRef>
          </ResourceProxy>
          <!-- ... -->
        </ResourceProxyList>
      </Resources>
      <!-- ... -->
    </CMD>
    

Endpoint custom extensions

The CLARIN-FCS protocol specification allows Endpoints to add custom data to their responses. This extension mechanism can for example be used to provide hints to an (XSLT/XQuery) application that works directly on CLARIN-FCS, e.g. to allow it to generate back and forward links to navigate in a result set.

*WIP*

An Endpoint MAY add arbitrary XML fragments to a <fcs:Resource> element. Clients MUST ignore any custom extensions they do not understand.

  • Example:
    <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="hdl:4711/0815">
        <fcs:DataView type="application/x-clarin-fcs-hits+xml">
          <hits:Result xmlns:hits="http://clarin.eu/fcs/1.0/hits">
            The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.
          </hits:Result>
        </fcs:DataView>
        
        <!--
            NOTE: this is purly fictional and only serves to demonstrate how
                  to add custom extensions to the result representation
                  within CLARIN-FCS.
    
            A hypothetical Endpoint extension for navigation in a result set:
            it basically provdes a set of hrefs, that a GUI can convert into
            navigation buttions.    
        -->
        <nav:navigation xmlns:nav="http://repos.example.org/navigation">
            <nav:curr href="http://repos.example.org/resultset/4711/4611" />
            <nav:prev href="http://repos.example.org/resultset/4711/4610" />
            <nav:next href="http://repos.example.org/resultset/4711/4612" />
        </nav:navigation>
    </fcs:Resource>