{{{ #!div class="system-message" '''NOTE''': This page is work-in-progress. Final draft is scheduled to be delivered by 2015-10-31. }}} [[PageOutline(1-6)]] = CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0 = Introduction {{{ #!div style="border: 1px solid #000000; font-size: 75%" TODO: Proof-read/Check sub-sections. }}} The goal of the ''CLARIN Federated Content Search (CLARIN-FCS) - Core'' specification is to introduce an ''interface specification'' that decouples the ''search engine'' functionality from its ''exploitation'', i.e. user-interfaces, third-party applications, and to allow services to access heterogeneous search engines in a uniform way. == Terminology The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [#REF_RFC_2119 RFC2119]. == Glossary Aggregator:: A module or service to dispatch queries to repositories and collect results. CLARIN-FCS, FCS:: CLARIN federated content search, an interface specification to allow searching within resource content of repositories. Client:: A software component, which implements the interface specification to query Endpoints, i.e. an aggregator or a user-interface. CQL:: Contextual Query Language, previously known as Common Query Language, is a domain specific language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information. Data View:: A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation. Data View Payload, Payload:: The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation. Endpoint:: A software component, which implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine. FCS-QL:: Federated Content Search Query Language is the query language used in the advanced CLARIN-FCS profile. It is derived from Corpus Workbench's [#REF_CQP_Tutorial CQP-TUTORIAL] Hit:: A piece of data returned by a Search Engine that matches the search criterion. What is considered a Hit highly depends on Search Engine. Interface Specification:: Common harmonized interface and suite of protocols that repositories need to implement. PID:: A Persistent identifier is a long-lasting reference to a digital object. Repository:: A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata). Repository Registry:: A separate service that allows registering Repositories and their Endpoints and provides information about these to other components, e.g. an Aggregator. The [http://centres.clarin.eu/ CLARIN Center Registry] is an implementation of such a repository registry. Resource:: A searchable and addressable entity at an Endpoint, such as a text corpus or a multi-modal corpus. Resource Fragment:: A smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription. Result Set:: An (ordered) set of hits that match a search criterion produced by a search engine as the result of processing a query. Search Engine:: A software component within a repository, that allows for searching within the repository contents. SRU:: Search and Retrieve via URL, is a protocol for Internet search queries. Originally introduced by Library of Congress [#REF_LOC_SRU_12 LOC-SRU12], later standardization process moved to OASIS [#REF_SRU_12 OASIS-SRU12]. == Normative References RFC2119[=#REF_RFC_2119]:: Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ [http://www.ietf.org/rfc/rfc2119.txt] XML-Namespaces[=#REF_XML_Namespaces]:: Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009, \\ [http://www.w3.org/TR/2009/REC-xml-names-20091208/] OASIS-SRU-Overview[=#REF_SRU_Overview]:: searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html (HTML)], [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.pdf (PDF)] OASIS-SRU-APD[=#REF_SRU_APD]:: searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.pdf (PDF)] OASIS-SRU12[=#REF_SRU_12]:: searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.pdf (PDF)] OASIS-CQL[=#REF_CQL]:: searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.pdf (PDF)] SRU-Explain[=#REF_Explain]:: searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.pdf (PDF)] SRU-Scan[=#REF_Scan]:: searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.PDF (PDF)] LOC-SRU12[=#REF_LOC_SRU_12]:: SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ [http://www.loc.gov/standards/sru/sru-1-2.html] LOC-DIAG[=#REF_LOC_DIAG]:: SRU Version 1.2: SRU Diagnostics List, Library of Congress,\\ [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] UD-POS[=#REF_UD_POS]:: Universal Dependencies, Universal POS tags, \\ [https://universaldependencies.github.io/docs/u/pos/index.html] SAMPA[=#REF_SAMPA]:: Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7 CLARIN-FCS-!DataViews[=#REF_FCS_DataViews]:: CLARIN Federated Content Search (CLARIN-FCS) - Data Views, SCCTC FCS Task-Force, April 2014, \\ [https://trac.clarin.eu/wiki/FCS/Dataviews] == Non-Normative References CQP-TUTORIAL[=#REF_CQP_Tutorial]:: Evert et al.: The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, CWB Version 3.0, February 2010, \\ [http://cwb.sourceforge.net/files/CQP_Tutorial/] RFC6838[=#REF_RFC_6838]:: Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ [http://www.ietf.org/rfc/rfc6838.txt] RFC3023[=#REF_RFC_3023]:: XML Media Types, IETF RFC 3023, January 2001, \\ [http://www.ietf.org/rfc/rfc3023.txt] == Typographic and XML Namespace conventions The following typographic conventions for XML fragments will be used throughout this specification: * `` \\ An XML element with the Generic Identifier ''Element'' that is bound to an XML namespace denoted by the prefix ''prefix''. * `@attr` \\ An XML attribute with the name ''attr'' {{{#!comment * `@prefix:attr` \\ An XML attribute with the name ''attr'' that is bound to an XML namespaces denoted by the prefix ''prefix''. }}} * `string` \\ The literal ''string'' must be used either as element content or attribute value. Endpoints and Clients `MUST` adhere to the [#REF_XML_Namespaces XML-Namespaces] specification. The CLARIN-FCS interface specification generally does not dictate whether XML elements should be serialized in their prefixed or non-prefixed syntax, but Endpoints `MUST` ensure that the correct XML namespace is used for elements and that XML namespaces are declared correctly. Clients `MUST` be agnostic regarding syntax for serializing the XML elements, i.e. if the prefixed or un-prefixed variant was used, and `SHOULD` operate solely on ''expanded names'', i.e. pairs of ''namespace name'' and ''local name''. The following XML namespace names and prefixes are used throughout this specification. The column "Recommended Syntax" indicates which syntax variant `SHOULD` be used by the Endpoint to serialize the XML response. ||=Prefix =||=Namespace Name =||=Comment =||=Recommended Syntax =|| || `fcs` || `http://clarin.eu/fcs/resource` || CLARIN-FCS Resources || prefixed || || `ed` || `http://clarin.eu/fcs/endpoint-description` || CLARIN-FCS Endpoint Description || prefixed || || `hits` || `http://clarin.eu/fcs/dataview/hits` || CLARIN-FCS Generic Hits Data View || prefixed || || `adv` || `http://clarin.eu/fcs/dataview/advanced` || CLARIN-FCS Advanced Data View || prefixed || || `sru` || `http://www.loc.gov/zing/srw/` || SRU || prefixed || || `diag` || `http://www.loc.gov/zing/srw/diagnostic/` || SRU Diagnostics || prefixed || || `zr` || `http://explain.z3950.org/dtd/2.0/` || SRU/ZeeRex Explain || prefixed || {{{ #!div style="border: 1px solid #000000; font-size: 75%" Careful with the SRU Namespaces; they probably need to be adjusted SRU 2.0 (=> OASIS). }}} = CLARIN-FCS Interface Specification The CLARIN-FCS Interface Specification defines a set of capabilities, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms. Specifically, the CLARIN-FCS Interface Specification consists of two parts, a set of formats, and a transport protocol. The ''Endpoint'' component is a software component that acts as a bridge between a ''Client'' and a ''Search Engine'' and passes the requests sent by the ''Client'' to the ''Search Engine''. The ''Search Engine'' is a custom software component that allows the search of language resources in a Repository. The ''Endpoint'' implements the ''Transport Protocol'' and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of ''Search Engines'' of the individual Repositories. The following figure illustrates the overall architecture: {{{ +---------+ | Client | +---------+ /|\ | | SRU / CQL | w/CLARIN-FCS extensions | \|/ +----------------------------------------------+ | | Endpoint /|\ | | | | | | ------------------- ------------------- | | | translate request | | translate result | | | ------------------- ------------------- | | | | | | \|/ | | +----------------------------------------------+ /|\ | | Search Engine specific protocols/formats | \|/ +---------------------------+ | Search Engine | +---------------------------+ }}} In general, the work flow in CLARIN-FCS is as follows: a Client submits a query to an Endpoint. The Endpoint translates the query from CQL or FCS-QL to the query dialect used by the Search Engine and submits the translated query to the Search Engine. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends them to the Client. == Discovery #Discovery The ''Discovery'' step allows a Client to gather information about an Endpoint, in particular which capabilities are supported or which resources are available for searching. === Capabilities A ''Capability'' defines a certain feature set that is part of CLARIN-FCS, e.g. what kind of queries are supported. Each Endpoint implements some (or all) of these Capabilities. The Endpoint will announce the capabilities it provides to allow a Client to auto-tune itself (see section [#endpointDescription Endpoint Description]). Each Capability is identified by a ''Capability Identifier'', which uses the URI syntax. The following Capabilities are defined in CLARIN-FCS defined: ||=Name =||=Capability Identifier =||=Summary =|| || ''Basic Search'' || `http://clarin.eu/fcs/capability/basic-search` || Simple full-text searching || || ''Advanced Search'' || `http://clarin.eu/fcs/capability/advanced-search` || Searching in structured and/or annotated data || Endpoints `MUST` implement the ''Basic Search'' Capability. Endpoints `MUST NOT` invent custom Capability Identifiers and `MUST` only use the values defined above. === Endpoint Description #endpointDescription {{{ #!div style="border: 1px solid #000000; font-size: 75%" Add stuff required for advanced capability. }}} == Searching In the ''Searching'' step the Client performs the actual search request to a to previously [#Discovery discovered] Endpoint. === Basic Search #basicSearch The ''Basic Search'' capability provides simple full-text search. Queries in Basic Search `MUST` be performed in the ''Contextual Query Language'' ([#REF_CQL OASIS-CQL]). The Endpoint `MUST` support ''term-only'' queries. The Endpoint `SHOULD` support ''terms'' combined with boolean operator queries (''AND'' and ''OR''), including sub-queries. An Endpoint `MAY` also support ''NOT'' or ''PROX'' operator queries. If an Endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). The Endpoint `MUST` perform the query on an annotation tier that makes the most sense for the user, i.e. the textual content for a text corpus resource or the orthographic transcription of a spoken language corpus. Endpoints `SHOULD` perform the query case-sensitive. Examples for valid CQL queries for Basic Search are: {{{ cat "cat" cat AND dog "grumpy cat" "grumpy cat" AND dog "grumpy cat" OR "lazy dog" cat AND (mouse OR "lazy dog") }}} '''NOTE''': In CQL, a ''term'' can be a single token or a phrase, i.e. tokens separated by spaces. If a single ''term'' contains spaces, it needs to be quoted. \\ '''NOTE''': Endpoints `MUST` be able to parse all of CQL. If they don't support a certain CQL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). Especially, if an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc. === Advanced Search The ''Advanced Search'' capability allows searching in annotated data. Queries can be across annotation layer, e.g. token and part-of-speech layer. CLARIN-FCS defined a set of search-able annotation layers with certain semantics and syntax. Endpoints `SHOULD` support as many different, of course depending n the resource type, annotation layers as possible. ==== Layers ||=Identifier =||=Annotation Tier Description =||=Syntax =||=Examples (without quotes) =|| || `token` || Appropriate tokenisation of resource, i.e. words || ''String'' || "Dog", "cat", "walked" || || `lemma` || Lemmatisation of tokens || ''String'' || "good", "walking", "dog" || || `pos` || Part-of-Speech annotations || [#REF_UD_POS Universal POS tags] || "NOUN", "VERB", "ADJ" || || `orth` || Orthographic transcription of (mostly) spoken resources || ''String'' || "dug", "cat", "wolking" || || `norm` || Orthographic normalization of (mostly) spoken resources || ''String'' || "dog", "cat", "walking" || || `phonetic` || Phonetic transcription || [#REF_SAMPA SAMPA] || "'du:", "'vi:-d6 'ha:-b@n" || || `names` || Named entities || ''String'' || "Utrecht", "Poland", "Felix the Cat" || || `text` || Annotation tier that is used in [#basicSearch Basic Search] || ''String'' || "Dog", "cat" "walked" || The column Syntax describes the inventory of symbols that a Client `MUST` use with a corresponding annotation layer; the value ''String'' denotes that symbols are arbitrary Unicode Strings, i.e. no fixed inventory of symbols are defined. ==== FCS-QL About available layers === Result Format ==== Resource and !ResourceFragment ==== Data View ===== Generic Hits (HITS) ===== Advanced (ADV) {{{ #!div style="border: 1px solid #000000; font-size: 75%" New section. }}} === "Versioning and Extensions" ==== "Backwards compatibility statements" {{{ #!div style="border: 1px solid #000000; font-size: 75%" Say something about backwards compatibility with "basic-search". \\ Clients should also be compatible with FCS 1.0 (= SRU 1.2) and use heuristic to determine, if an endpoint is still using FCS 1.0. }}} ==== Endpoint Custom Extensions {{{ #!div style="border: 1px solid #000000; font-size: 75%" Talk about extensions in general; this section needs to stay in normative part due to the namespace stuff }}} = CLARIN-FCS to SRU/CQL binding == SRU/CQL {{{ #!div style="border: 1px solid #000000; font-size: 75%" SRU 2.0 requirement }}} == Operation ''explain'' {{{ #!div style="border: 1px solid #000000; font-size: 75%" Basically stays the same, but adjust for advanced stuff. }}} == Operation ''scan'' {{{ #!div style="border: 1px solid #000000; font-size: 75%" Basically stays the same, but adjust for advanced stuff (if required). }}} == Operation ''searchRetrieve'' {{{ #!div style="border: 1px solid #000000; font-size: 75%" Align with newly introduced section "Search Phase" \\ Define String for SRU query-lanaguge paramater ("fcs"? "clarin-fcs"?) }}} = Normative Appendix == List of extra request parameters {{{ #!div style="border: 1px solid #000000; font-size: 75%" Revisit and update as required; don't forget to add the new request parameter ("allow rewriting" => allow endpoint to trade precision in favor of recall). }}} == List of diagnostics {{{ #!div style="border: 1px solid #000000; font-size: 75%" Revisit and update as required; don't forget to add the 4 new diagnostics. }}} = Non-normative Appendix == Syntax variant for Handle system Persistent Identifier URIs == Referring to an Endpoint from a CMDI record == Endpoint custom extensions == Endpoint highlight hints for repositories {{{ #!div style="border: 1px solid #000000; font-size: 75%" All sections to be updated as required / maybe check if something should be deleted. }}}