= FCS Specification Scrapbook = == Issues with current document == 1. Uncomprehensible and not well structured :( 2. Resource enumeration (aka scan on fcs.resource) rather complex and unintuitive 3. Basic KWIC records has no provision for multiple "highlight" hits 4. No (clear) recommendation for using Resource and !ResouceFragment 5. What about recursiveness in Resource (see current schema)? What is the use case? == General ideas / design goals towards better specification == 1. Define FCS conformance level ''independent'' of what SRU/CQL do. Don't call them "level", but maybe something like ''profile'' to avoid confusion. 1. Do a ''basic profile'' first 2. Do an ''advanced/extend profile'' later in a separate specification or specification amendment (which must be, of course, compatible to basic profile) 3. Add provisions to, e.g. explain output, to allow endpoints to indicate the profile, they support 2. Better structure of document (and don't include aggregation stuff; that's a different specification; implementors of endpoints should not need to worry about aggregator implementation) 3. Keep XML sanity always in mind (so there are no namespace issues as in CMDI) 4. Drop resource enumeration in favor of endpoint resource description 5. Drop the recursiveness of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)` 6. Drop the KWIC data view in favor of HITS data view; the latter will allow for multiple hit highlights 7. Honor and use extension hooks provided by SRU/CQL 8. Non-normative stuff 1. Endpoint specific extension hooks, e.g. to avoid tag abuse of !DataView. Resource.xsd could provide an extension hook, so arbitrary XML could also be embedded. 2. Clients can put query parameters at @ref to allow hit highlighting on their systems == Proposal for new specification == The following is a proposal for a revisited federated content search specification. When done, cut and paste to the appropriate section of the Wiki and publish on the CLARIN web page. ---- = CLARIN Federated Content Search (CLARIN-FCS) = [[PageOutline(1-5)]] == Introduction == The main goal of CLARIN federated content search (CLARIN-FCS) is to introduce a ''interface specification'', to decouple the ''search engine'' functionality from its ''exploitation'', i.e. user-interfaces, third-party applications and to allow services to access search engines in an uniform way. === Terminology === The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [#REF_RFC_2119 RFC2119]. === Glossary === Aggregator:: A module or service to dispatch queries to repositories and collect results. CLARIN-FCS, FCS:: CLARIN federated content search, an interface specification to allow searching within resource content of repositories. Client:: A software component, that implements the interface specification to query Endpoints, i.e. an aggregator or an user-interface. CQL:: Contextual Query Language, previously known as Common Query Language, is a formal language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information. Endpoint:: A software component, that implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine. Interface Specification:: Common harmonized interface and suite of protocols that repositories need to implement. Search Engine:: A software component within a repository, that allows for searching within the repository contents. SRU:: Search and Retrieve via URL, is a protocol for Internet search queries. Data View:: A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation. Data View Payload, Payload:: The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation. PID:: A Persistent identifier is a long-lasting reference to a digital object. Repository:: A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata). Repository Registry:: A separate service that allows registering Endpoints and provides information about these to other components, e.g. an aggegator. The [http://centres.clarin.eu/ CLARIN Center Registry] is an implementation of such a repository registry. === Normative References === RFC2119[=#REF_RFC_2119]:: Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ [http://www.ietf.org/rfc/rfc2119.txt] XML-Namespaces[=#REF_XML_Namespaces]:: Namespaces in XML 1.1 (Second Edition), W3C, August 2006, \\ [http://www.w3.org/TR/2006/REC-xml-names11-20060816] OASIS-SRU-Overview[=#REF_SRU_Overview]:: searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html (HTML)], [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.pdf (PDF)] OASIS-SRU-APD[=#REF_SRU_APD]:: searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.pdf (PDF)] OASIS-SRU12[=#REF_SRU_12]:: searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.pdf (PDF)] OASIS-CQL[=#REF_CQL]:: searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.pdf (PDF)] SRU-Explain[=#REF_Explain]:: searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.pdf (PDF)] SRU-Scan[=#REF_Scan]:: searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.PDF (PDF)] LOC-SRU12[=#REF_LOC_SRU_12]:: SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ [http://www.loc.gov/standards/sru/sru-1-2.html] LOC-DIAG[=#REF_LOC_DIAG]:: SRU Version 1.2: SRU Diagnostics List, Library of Congress,\\ [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] === Non-Normative References === RFC6838[=#REF_RFC_6838]:: Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ [http://www.ietf.org/rfc/rfc6838.txt] RFC3023[=#REF_RFC_3023]:: XML Media Types, IETF RFC 3023, January 2001, \\ [http://www.ietf.org/rfc/rfc3023.txt] KML[=#REF_KML_Spec]:: Keyhole Markup Language (KML), Open Geospatial Consortium, 2008, \\ [http://www.opengeospatial.org/standards/kml] === Typographic and XML Namespace conventions === The following typographic conventions for XML fragments will be used throughout this specification: * `` \\ An XML element with the Generic Identifier ''Element'' that is bound an XML namespace denoted by the prefix ''prefix''. * `@attr` \\ An XML attribute with the name ''attr'' {{{#!comment * `@prefix:attr` \\ An XML attribute with the name ''attr'' that is bound to an XML namespaces denoted by the prefix ''prefix''. }}} * `string` \\ The literal ''string'' must be used either as element content or attribute value. Endpoints and Clients `MUST` adhere the [#REF_XML_Namespaces XML-Namespaces] specification. The CLARIN-FCS interface specification generally does not dictate whether XML elements should be serialized in their prefixed or non-prefixed syntax, but Endpoints `MUST` ensure that the correct XML namespace is used for elements and that XML namespaces are declared correctly. Clients `MUST` be agnostic to which syntax for serializing the XML elements, i.e. if the prefixed or un-prefixed variant was used, and `SHOULD` operate solely on ''expanded names'', i.e. pairs of ''namespace name'' and ''local name''. The following XML namespace names and prefixes are used throughout this specification. The column ''Recommended Syntax'' indicates, which syntax variant `SHOULD` be used by Endpoints when serializing the XML response. ||=Prefix =||=Namespace Name =||=Comment =||= Recommended Syntax =|| || `fsc` || `http://clarin.eu/fcs/1.0` || CLARIN-FCS Resources || prefixed || || `ed` || `http://clarin.eu/fcs/1.0/endpoint-description` || CLARIN-FCS Endpoint Description || prefixed || || `hits` || `http://clarin.eu/fcs/1.0/hits` || CLARIN-FCS Generic Hits || prefixed || || `sru` || `http://www.loc.gov/zing/srw/` || SRU || prefixed || || `diag` || `http://www.loc.gov/zing/srw/diagnostic/` || SRU Diagnostics || prefixed || || `zr` || `http://explain.z3950.org/dtd/2.0/` || SRU/ZeeRex Explain || prefixed || || `cmdi` || `http://www.clarin.eu/cmd/` || Component Metadata || un-prefixed || || `kml` || `http://www.opengis.net/kml/2.2` || Keyhole Markup Language || un-prefixed || == CLARIN-FCS Interface Specification == The CLARIN-FCS interface specification defined two profiles, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms. Generally, CLARIN-FCS Interface Specification consists of two components, a set of ''formats'' and a ''transport protocol''. The ''Endpoint'' component is a software component that acts as a bridge between the Formats, that are send by a ''Client'' using the ''Transport Protocol'', and a ''Search Engine''. The ''Search Engine'' is a custom software component, that allows searching in the language resources of a CLARIN center. The ''Endpoint'' basically implements the ''transport protocol'' and acts as an mediator between the CLARIN-FCS specific formats and the idiosyncrasies of ''Search Engines''. The following figure illustrates the overall architecture. {{{ +---------+ | Client | +---------+ /|\ | ------------------------- | SRU / CQL | | w/CLARIN-FCS extensions | ------------------------- | \|/ +-----------------------------------------+ | | Endpoint /|\ | | | | | | --------------- ------------------ | | | Translate CQL | | Translate Result | | | --------------- ------------------ | | | | | | \|/ | | +-----------------------------------------+ /|\ | \|/ +---------------------------+ | Search Engine | +---------------------------+ }}} The following sections describe the CLARIN-FCS profiles and query and result formats, how SRU/CQL is used as a transport protocol in the context of CLARIN-FCS and the required CLARIN-FCS specific extensions to SRU. === Profiles === CLARIN-FCS defines two profiles: ''Basic profile'':: Endpoints `MUST` support ''term-only'' queries. \\ Endpoints `SHOULD` support ''terms'' combined with boolean operator queries (''AND'' and ''OR''), including subqueries. Endpoints `MAY` also support ''NOT'' or ''PROX'' operator queries. If an Endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\ Examples for valid CQL queries for the ''basic profile'': {{{ cat "cat" cat AND dog "grumpy cat" "grumpy cat" AND dog "grumpy cat" OR "lazy dog" cat AND (mouse OR "lazy dog") }}} The Endpoint is `MUST` perform the query on an annotation tier, that makes the most sense for the user, i.e. the textual content for a text corpus resource or the orthographic transcription of a spoken language corpus. Endpoints `SHOULD` perform the query case-sensitive.\\ Endpoint `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc. ''Extended profile'':: This profile will support more sophisticated queries such as selecting annotation tiers, expanding of tags, or mapping of data categories. \\ '''NOTE''': the extended profile is not yet defined and will be part of a future CLARIN-FCS specification. Endpoints and Clients `MUST` support the ''basic profile''. For now, Endpoints and Clients `MUST NOT` claim to support the ''extended profile''. === Result Format ===#resultFormat CLARIN-FCS uses a customized format for returning results. ''Resource'' and ''Resource Fragments'' serve as containers for hit results, that are presented in one or more ''Data View''. The following section describes the Resource format and Data View format and section [#searchRetrieve Operation ''searchRetrieve''] will describe, how hits are embedded within SRU responses. ==== Resource and !ResourceFragment ==== To encode search results, CLARIN-FCS supports two building blocks: Resources:: A ''Resource'' is an searchable entity at an Endpoint, such as a text corpus or an multi-modal corpus. A resource `SHOULD` be a self-contained unit, i.e. not a sentence in a text corpus or a time interval in an audio transcription. Resource Fragments:: A ''Resource Fragment'' is a smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription. A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that is encoded in a Resource Fragment. Endpoints `SHOULD` always provide a links to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing an Endpoint unique URI. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI. If an Endpoint can provide both, a persistent identifier as well as an URI, for either Resource or Resource Fragment, they `SHOULD` provide both. When working with results, Clients `SHOULD` prefer persistent identifiers over regular URIs. Resource and Resource Fragment are serialized in XML and Endpoints `MUST` generate responses, that are valid according to the XML schema "[source:FederatedSearch/schema/Resource.xsd Resource.xsd]" ([source:FederatedSearch/schema/Resource.xsd?format=txt download]). A Resource is encoded in the form of a `` element, a ''Resource Fragment'' in the form of a `` element. The content of a Data View is wrapped in a `` element. `` is the top-level element and `MAY` contain zero or more `` elements and `MAY` contain zero or more `` elements. A `` element `MUST` contain one or more `` elements. The elements ``, `` and `` `MAY` carry a `@pid` and/or a `@ref` attribute, which allows linking to the original data represented by the Resource, Resource Fragment, or Data View. A `@pid` attribute `MUST` contain a valid persistent identifier, a `@ref` `MUST` contain valid URI, i.e. a "plain" URI without the additional semantics of being a persistent reference. Endpoints `MUST` use the identifier `http://clarin.eu/fcs/1.0` for the ''responseItemType'' (= content for the `` element) in SRU responses. Endpoints `MAY` serialize hits as multiple Data Views, however they `MUST` provide the Generic Hits (HITS) Data View either encoded as a Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable Resource Fragment). Other Data Views `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata Data View would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment. [=#REF_Example_1]Example 1: {{{#!xml }}} This [#REF_Example_1 example] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. [=#REF_Example_2]Example 2: {{{#!xml }}} This [#REF_Example_2 example] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the Endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. [=#REF_Example_3]Example 3: {{{#!xml }}} The most complex [#REF_Example_3 example] is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. All entities of the Hit can be referenced by a persistent identifier and an URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`. Endpoints `MUST` serialize one Resource for each hit, i.e. they `MUST NOT` combine several hits in one Resource. E.g., if a query matches five different sentences within one text (= the resource), the Endpoint must serialize five Resource (= one per hit) and embed each within one SRU result (see [#searchRetrieve below]). ==== Data View ==== A ''Data View'' serves as a container for representing search results within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e. they are deliberately kept open to allow further extensions with more supported Data View formats. The content of a Data View is called ''Payload''. Each Payload is typed and the type of the Payload is recorded in the `@type` attribute if the `` element. The Payload type is is identified by a MIME type ([#REF_RFC_6838 RFC6838], [#REF_RFC_3023 RFC3023]). If no existing MIME type can be used, implementors `SHOULD` define a properer private mime type. The Payload of a Data View can either be deposited ''inline'' or by ''reference''. In the case of ''inline'', it `MUST` be serialized as an XML fragment below the `` element. This method is the preferred methods payloads that can easily serialized in XML. In the case of by ''reference'', the content cannot easily deposited inline, i.e. it is binary content. In this case, the Data View `MUST` include a `@ref` or `@pid` attribute that links location for Clients to download the payload. This location `SHOULD` be ''openly accessible'', i.e. data can be downloaded freely without any need to perform a login. For the ''basic'' profile, the Data Views ''Generic Hits'', ''Component Metadata'', ''Image'' and ''Geolocation'' are defined in this specification. Endpoints `MAY` define custom Data Views, but Clients conforming to the ''basic'' profile `MAY` choose to ignore them. The ''Generic Hits'' Data View is mandatory, thus all Endpoints `MUST` provide hits represented in the ''Generic Hits'' Data View. '''NOTE''': The examples in the following sections ''show only'' the payload with the enclosing `` element of a Data View. Of course, the Data View must be embedded either in a `` or a `` element. The `@pid` and `@ref` attributes have been omitted for all ''inline'' payload types. ===== Generic Hits (HITS) ===== ||=Description =|| The representation of the hit || ||=MIME type =|| `application/x-clarin-fcs-hits+xml` || ||=Payload Disposition =|| ''inline'' || The ''Generic Hits'' Data View contains the serialization of a search result hit. It supports multiple maskers for suppling highlighting for the hit. Each hit `SHOULD` be presented within the context of a complete sentence. If that is not possible due to the nature of the type of the resource, the the Endpoint `SHALL` provide an equivalent reasonable unit of context (e.g. within a phrase of a orthographic transcription of an utterance). All Endpoints `MUST` provide hits represented in this Data View. The XML fragment of the Generic Hits payload `MUST` be valid according to the XML schema "[source:FederatedSearch/schema/DataView-Hits.xsd DataView-Hits.xsd]" ([source:FederatedSearch/schema/DataView-Hits.xsd?format=txt download]). * Example (single hit marker): {{{#!xml The quick brown fox jumps over the lazy dog. }}} * Example (multiple hit markers): {{{#!xml The quick brown fox jumps over the lazy dog. }}} ===== Component Metadata (CMDI) ===== ||=Description =|| A CMDI metadata record || ||=MIME type =|| `application/x-cmdi+xml` || ||=Payload Disposition =|| ''inline'' or ''reference'' || The ''Component Metadata'' Data View allows to embed a CMDI metadata record that ''applicable'' to the specific context into the Endpoint response, e.g. metadata about the resource in which the hit was produced. If this CMDI record is applicable for the entire Resource, is `SHOULD` be put in a `` element below the `` element. If it is applicable to the Resource Fragment, i.e. it contains more specialized metadata than the metadata for the encompassing resource, it `SHOULD` be put in a `` element below the `` element. Endpoints `SHOULD` provide the payload ''inline'', but Endpoints `MAY` also use the ''reference'' method. If an Endpoint uses the ''reference'' method, the CMDI metadata record `MUST` be downloadable without any restrictions. * Example (inline): {{{#!xml }}} * Example (referenced): {{{#!xml }}} ===== Images (IMG) ===== ||=Description =|| An image related to the hit || ||=MIME type =|| `image/png`, `image/jpeg`, `image/gif`, `image/svg+xml` || ||=Payload Disposition =|| ''reference'' || The ''Image'' Data View allows top provide an image, that is relevant to the hit, e.g. a facsimile of the source of a transcription. Endpoints `MUST` provide the payload by the ''reference'' method and the image file `SHOULD` be downloadable without any restrictions. * Example: {{{#!xml }}} ===== Geolocation (GEO) ===== ||=Description =|| An geographic location related to the hit || ||=MIME type =|| `application/vnd.google-earth.kml+xml` || ||=Payload Disposition =|| ''inline'' || The ''Geolocation'' Data View allows to geolocalize a hit. If `MUST` be encoded using the XML representation of the Keyhole Markup Language (KML). The KML fragment `MUST` comply with the specification as defined by [#REF_KML_Spec KML]. * Example: {{{#!xml IDS Mannheim Institut für Deutsche Sprache, R5 6-13, 68161 Mannheim, Germany 8.4719510,49.4883700,0 }}} === Endpoint Description ===#endpointDescription Endpoints need to provide information about their capabilities to support auto-configuration of Clients, This capabilities include, among other information, the Profile that is supported by an Endpoint. The ''Endpoint Description'' mechanism provides the necessary facility to provide this information to the Clients. Endpoints `MUST` encode their capabilities using an XML format and embed this information into the SRU/CQL protocol as described in section [#explain Operation ''explain'']. The XML fragment generated by the Endpoint for the Endpoint Description `MUST` be valid according to the XML schema "[source:FederatedSearch/schema/Endpoint-Description.xsd Endpoint-Description.xsd]" ([source:FederatedSearch/schema/Endpoint-Description.xsd?format=txt download]). The XML fragment for ''Endpoint Description'' is encoded as an `` element, that contains the following children: * one `` element (`REQUIRED`) \\ The content of the `` element indicates the Profile, that is supported by the Endpoint. \\ Valid values are: * `basic`: the Endpoint supports the ''basic'' Profile \\ '''NOTE''': a future CLARIN-FCS specification will introduce more values. * one `` (`REQUIRED`) \\ A list of Data Views, that are supported by this Endpoint. This list is composed of one or more `` elements. The content of a `` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. * one `` element (`REQUIRED`) \\ A list of (top-level) collections that are available at the Endpoint. The `` element contains one or more `` elements (see below). An Endpoint `MUST` declare at least one (top-level) collection. The `` element contains a detailed description of a collection that is available at an Endpoint. A collection is a searchable entity, e.g. a single corpus. The `` has a mandatory `@pid` attribute, that contains persistent identifier of the collection. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the collection. The `` element contains the following children: * one or more `` elements (`REQUIRED`) \\ A human readable title for the collection. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language. * zero or more `` elements (`OPTIONAL`) \\ An optional human-readable description of the collection. Is `SHOULD` be at most one sentence. A `REQUIRED` `@xml:lang` attribute indicates the language of the description. If supplied, an English version of the description is `REQUIRED`. The list of descriptions `MUST NOT` contain duplicate entries for the same language. * zero or one `` element (`OPTIONAL`) \\ A link to a website for this collection, e.g. a landing page for a collection, i.e. a web-site that describes a corpus. * one `` element (`REQUIRED`) \\ The (relevant) languages available within the collection. The `` element contains one or more `` elements. The content of a `` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the collection, however this list `MUST NOT` contain duplicate entries. * zero or one `` element (`OPTIONAL`) \\ If a collection has searchable sub-collections the Endpoint `MUST` supply additional finer grained collection elements, which are wrapped in a `` element. A sub-collection is a searchable entity within a collection, e.g. a sub-corpus. [=#REF_Example_4]Example 4: {{{#!xml basic application/x-clarin-fcs-hits+xml Goethe Corpus Goethe Korpus Der Goethe Korpus des IDS Mannheim. The Goethe corpus of IDS Mannheim. http://repos.example.org/corpus1.html deu }}} This [#REF_Example_4 example] shows a simple Endpoint Description for an Endpoint that supports the ''basic'' Profile and only provides the Generic Hits Data View. It only provides one top-level collection identified by the persistent identifier `http://hdl.handle.net/4711/0815`. The collection a title as well as a description in German and English. A landing page is located at `http://repos.example.org/corpus1.html`. The searchable collection contents are only available in German. [=#REF_Example_5]Example 5: {{{#!xml basic application/x-clarin-fcs-hits+xml application/x-cmdi+xml Goethe Corpus Goethe Korpus Der Goethe Korpus des IDS Mannheim. The Goethe corpus of IDS Mannheim. http://repos.example.org/corpus1.html deu Mannheimer Morgen newspaper Corpus Zeitungskorpus des Mannheimer Morgen http://repos.example.org/corpus2.html deu Mannheimer Morgen newspaper Corpus (before 1990) Zeitungskorpus des Mannheimer Morgen (vor 1990) http://repos.example.org/corpus2.html#sub1 deu Mannheimer Morgen newspaper Corpus (after 1990) Zeitungskorpus des Mannheimer Morgen (nach 1990) http://repos.example.org/corpus2.html#sub2 deu }}} This more complex [#REF_Example_5 example] show a Endpoint Description for an Endpoint that, similar to [#REF_Example_4 Example 4], supports the ''basic'' profile. In addition to the Generic Hits Data View it also supports CMDI the CMDI Data View. The Endpoint has two top-level collections (identified by the persistent identifiers `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816`. The second top-level collection has two sub-collections, identified by the persistent identifier `http://hdl.handle.net/4711/0816-1` and `http://hdl.handle.net/4711/0816-2`. All collections are described using several properties, like title, description, etc. == CLARIN-FCS to SRU/CQL binding == === SRU/CQL === SRU (!Search/Retrieve via URL) specifies a general communication protocol for searching and retrieving records and the CQL (Contextual Query Language) specifies a extensible query language. CLARIN-FCS is built on SRU 1.2. A subsequent specification may be built on SRU 2.0. Endpoints and Clients `MUST` implement the SRU/CQL protocol suite as defined in [#REF_SRU_Overview OASIS-SRU-Overview], [#REF_SRU_APD OASIS-SRU-APD], [#REF_CQL OASIS-CQL], [#REF_Explain SRU-Explain], [#REF_Scan SRU-Scan], especially with respect to: * Data Model, * Query Model, * Processing Model, * Result Set Model, and * Diagnostics Model Endpoints and Clients `MUST` use the implement the APD Binding for SRU 1.2, as defined in [#REF_SRU_12 OASIS-SRU-12]. Endpoints and Clients `MAY` implement APD binding for version 1.1 or version 2.0. Endpoints and Clients `MUST` use the following XML namespace names (namespace URIs) for serializing responses: * `http://www.loc.gov/zing/srw/` for SRU response documents, and * `http://www.loc.gov/zing/srw/diagnostic/` for diagnostics within SRU response documents. CLARIN-FCS deviates from the OASIS specification [#REF_SRU_Overview OASIS-SRU-Overview] and [#REF_SRU_12 OASIS-SRU-12] to ensure backwards comparability with SRU 1.2 services as they where defined by the [#REF_LOC_SRU_12 LOC-SRU12]. Endpoints or Clients `MUST` support CQL conformance ''Level 2'' (as defined in [#REF_OASIS_CQL OASIS-CQL, section 6]), i.e. be able to ''parse'' (Endpoints) or ''serialize'' (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface. '''NOTE''': this does ''not imply'', that Endpoints are ''required'' to support all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support. Endpoints `MUST` generate diagnostics according to [#REF_SRU_12 OASIS-SRU-12, Appendix C] for error conditions or to indicate unsupported features. Unfortunately, the OASIS specification does not provides a comprehensive list of diagnostics for CQL related errors. Therefore, Endpoints `MUST` use diagnostics from [#REF_LOC_DIAG LOC-DIAG, section "Diagnostics Relating to CQL"] for CQL related errors. === Operation ''explain'' ===#explain The ''explain'' operation of the SRU protocol serves to announce server capabilities and to allows clients to configure themselves automatically. This operation is used similarly. An Endpoint `MUST` respond to a ''explain'' request by a proper ''explain'' response. As per [#REF_Explain SRU-Explain], the response `MUST` contain one `` element that contains an ''SRU Explain'' record. The `` element `MUST` contain the literal `http://explain.z3950.org/dtd/2.0/`, i.e. the official ''identifier'' for Explain records. According to the Profile supported by the Endpoint the Explain record `MUST` contain the following elements: ''Basic'' Profile:: `` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ `` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ `` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`). This element `MUST` contain an element `` with an `@identifier` attribute with a value of `http://clarin.eu/fcs/1.0` and an `@name` attribute with a value of `fcs`. \\ `` is `OPTIONAL``\\ An ''extended'' profile may define how the `` element is to be used, therefore it `NOT RECOMMENDED` for Endpoints to define custom extensions. ''Extended'' Profile:: '''NOTE''': the extended profile is not yet defined and will be part of a future CLARIN-FCS specification. To support auto-configuration in CLARIN-FCS, an Endpoint provide an ''Endpoint Description''. The Endpoint Description is included in explain response utilizing SRUs extension mechanism, i.e. by embedding an XML fragment into the `` element. Endpoints `MUST` include the Endpoint Description ''only'' if a Client performs an explain request with the ''extra request parameter'' `x-clarin-fcs-endpoint-description` with a value of `true`. If a Client performs an explain request ''without'' supplying this extra request parameter the Endpoint `MUST NOT` include the Endpoint Description. The format of the Endpoint Description XML fragment is defined in [#REF_endpointDescription Endpoint Description]. The following example shows a request and response to an ''explain'' request with added extra request parameter `x-clarin-fcs-endpoint-description`: * HTTP GET request: Client -> Endpoint: {{{#!sh http://repos.example.org/fcs-endpoint?operation=explain&version=1.2&x-clarin-fcs-endpoint-description=true }}} * HTTP Response: Endpoint -> Client {{{#!xml 1.2 http://explain.z3950.org/dtd/2.0/ xml repos.example.org 80 sru Goethe Corpus Goethe Korpus Der Goethe Korpus des IDS Mannheim. The Goethe corpus of IDS Mannheim. CLARIN Federated Content Search 250 1000 1.2 https://clarin.ids-mannheim.de/digibibsru basic application/x-clarin-fcs-hits+xml Goethe Corpus Goethe Korpus Der Goethe Korpus des IDS Mannheim. The Goethe corpus of IDS Mannheim. http://repos.example.org/corpus1.html deu }}} === Operation ''scan'' ===#scan The ''scan'' operation of the SRU protocol is currently not used in the ''basic'' profile of CLARIN-FCS. An ''extended'' profile may use this operation, therefore it `NOT RECOMMENDED` for Endpoints to define custom extensions that use this operation. === Operation ''searchRetrieve'' ===#searchRetrieve The ''searchRetrieve'' operation of the SRU protocol is used for searching in the Resources that are provided by an Endpoint. The SRU protocol defines request and response formats in [#REF_SRU_12 OASIS-SRU-12]. Search result hits are encoded down to a record level, i.e. the `` element, and SRU allows records to be serialized in various formats, so called ''record schemas''. Endpoints `MUST` support the CLARIN-FCS record schema (see [#resultFormat above]). The ''responseItemType'' ("record schema identifier") that `MUST` be used for that for that schema is `http://clarin.eu/fcs/1.0`. In CLARIN-FCS, each record, i.e. `` element, `MUST` represent exactly ''one hit'' within the Resource. to belong in within the resource fragment. The following example shows a request and response to an ''searchRetrieve'' request with a ''term-only'' query for "cat": * HTTP GET request: Client -> Endpoint: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat }}} * HTTP Response: Endpoint -> Client {{{#!xml 1.2 6 http://clarin.eu/fcs/1.0 xml The quick brown cat jumps over the lazy dog. 1 1.2 cat cql.serverChoice = cat 1 http://repos.example.org/fcs-endpoint }}} In general, the Endpoint is `REQUIRED` to accept an unrestricted search and `MUST` then perform the search operation on all Resources at an Endpoint. The Client can request the Endpoint to ''restrict the search'' to a sub-collection of the Resources available. In this case, the Client `MUST` pass a comma-separated list of persistent identifiers in the {{{x-clarin-fcs-context}}} extra request parameter of the ''searchRetrieve'' request. The Endpoint `MUST` then restrict the search to those Resources, that are identified by the persistent identifiers passed by Client. A Client can extract all valid persistent identifiers from the `@pid` attribute of the `` element, obtained by the ''explain'' request (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). The list of persistent identifiers can get extensive, but an agent `MAY` use the POST method instead of GET method for submitting the request. For example, to restrict the search to the Resource with the persistent identifier `http://hdl.handle.net/4711/0815` the Client must issue the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-clarin-fcs-context=http://hdl.handle.net/4711/0815 }}} To restrict the search to the Resources with the persistent identifier `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816-2` the Client must issue the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-clarin-fcs-context=http://hdl.handle.net/4711/0815,http://hdl.handle.net/4711/0816-2 }}} If an invalid persistent identifier is passed by a Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/1.0/diagnostic/1` diagnostic, i.e add the appropiate XML fragment to the `` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search or it `MAY` treat it a non-fatal and perform the search. == Normative Appendix === List of extra request parameters === The following extra request parameters are used in CLARIN-FCS: ||=Parameter Name =||=SRU operations =||=Allowed values =||= Description =|| || `x-clarin-fcs-endpoint-description` || explain || `true` \\ All other values are reserved an `MUST` not be used by Clients || If present, the Endpoint `MUST` include a Endpoint Description in the\\`` element of an ''explain'' response. || || `x-clarin-fcs-context` || searchRetrieve || A comma separated list of persistent identifiers || The Endpoint `MUST` restrict the search to the collections identified by\\the persistent identifiers || === List of diagnoistics === Apart from the SRU diagnostics defined in [#REF_SRU_12 OASIS-SRU-12, Appendix C] and LOC-DIAG[=#REF_LOC_DIAG], the following diagnostics are used in CLARIN-FCS. The "Details Format" column specifies what `SHOULD` be returned in the details field. If this column is blank, the format is "undefined" and the Endpoint `MAY` return whatever it feels appropriate, including nothing. ||=Identifier URI =||=Description =||= Details Format =|| || `http://clarin.eu/fcs/1.0/diagnostic/1` || Persistent identifier passed in for restricting the search is invalid || The offending persistent identifier || == Non-normative Appendix The following sections are non-normative. === Referring to an Endpoint from a CMDI record === Centers are encouraged to provide links to their CLARIN-FCS Endpoints in the metadata records for their resources. Other services, like the VLO, can use this information for automatically configuring an Aggregator for searching resources at the Endpoint. To refer to an Endpoint a `` with `` set to the value `SearchService` and a `@mimetype` attribute with a value of `application/sru+xml` need to be added to the CMDI record. The content of the `` element must contain an URI that points to the Endpoint web service. Example: {{{#!xml http://hdl.handle.net/4711/0815 SearchService http://repos.example.org/fcs-endpoint }}} === Endpoint custom extensions === The CLARIN-FCS protocol specification allows Endpoints to add custom data to their responses. This extension mechanism can for example be used to provide hints to an (XSLT/XQuery) application that works directly on CLARIN-FCS, e.g. to allow it to generate back and forward links to navigate in a result set. *WIP* An Endpoint `MAY` add arbitrary XML fragments to a `` element. Clients `MUST` ignore any custom extensions they do not understand. Endpoints `MUST` use a custom XML namespace name for their extensions. Endpoints `MUST NOT` use XML namespace names, that start with the prefixes `http://clarin.eu`, `http://www.clarin.eu/`, `https://clarin.eu` or `http://www.clarin.eu/`. * Example: {{{#!xml The quick brown fox jumps over the lazy dog. }}} ----