Changes between Version 50 and Version 51 of Taskforces/FCS/FCS-Specification-Draft
- Timestamp:
- 06/09/17 09:49:15 (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Taskforces/FCS/FCS-Specification-Draft
v50 v51 1 1 {{{ 2 2 #!div class="system-message" 3 '''NOTE''': This page is in final editing. Final draft is scheduled to be delivered by 2017-04-28.3 '''NOTE''': This page is final draft scheduled for approval by SCCTC May 2017. 4 4 }}} 5 5 [[PageOutline(1-6)]] 6 = CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0 7 8 = Introduction 6 7 = CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0 = 8 = Introduction = 9 9 {{{ 10 10 #!div style="border: 1px solid #000000; font-size: 75%" … … 13 13 The goal of the ''CLARIN Federated Content Search (CLARIN-FCS) - Core'' specification is to introduce an ''interface specification'' that decouples the ''search engine'' functionality from its ''exploitation'', i.e. user-interfaces, third-party applications, and to allow services to access heterogeneous search engines in a uniform way. 14 14 15 == Terminology 15 == Terminology == 16 16 The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [#REF_RFC_2119 RFC2119]. 17 17 18 == Glossary 19 Aggregator:: 20 A module or service to dispatch queries to repositories and collect results. 21 22 [=#REF_Annotation_Layer Annotation Layer]:: 23 An annotation layer is the sum of possible annotations for a language resource, such as part of speech or orthographic transcription. Usually it is related to a given annotation task or topic. For the scope of the specification it is used as synonym for annotation tier. 24 25 CLARIN-FCS, FCS:: 26 CLARIN federated content search, an interface specification to allow searching within resource content of repositories. 27 28 Client:: 29 A software component, which implements the interface specification to query Endpoints, i.e. an aggregator or a user-interface. 30 31 CQL:: 32 Contextual Query Language, previously known as Common Query Language, is a domain specific language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information. 33 34 Data View:: 35 A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation. 36 37 Data View Payload, Payload:: 38 The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation. 39 40 Endpoint:: 41 A software component, which implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine. 42 43 FCS-QL:: 44 Federated Content Search Query Language is the query language used in the advanced CLARIN-FCS profile. It is derived from Corpus Workbench's [#REF_CQP_Tutorial CQP-TUTORIAL] 45 46 Hit:: 47 A piece of data returned by a Search Engine that matches the search criterion. What is considered a Hit highly depends on Search Engine. 48 49 Interface Specification:: 50 Common harmonized interface and suite of protocols that repositories need to implement. 51 52 Layer:: 53 See [#REF_Annotation_Layer ''Annotation Layer''] 54 55 PID:: 56 A Persistent identifier is a long-lasting reference to a digital object. 57 58 Repository:: 59 A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata). 60 61 Repository Registry:: 62 A separate service that allows registering Repositories and their Endpoints and provides information about these to other components, e.g. an Aggregator. The [http://centres.clarin.eu/ CLARIN Center Registry] is an implementation of such a repository registry. 63 64 Resource:: 65 A searchable and addressable entity at an Endpoint, such as a text corpus or a multi-modal corpus. 66 67 Resource Fragment:: 68 A smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription. 69 70 Result Set:: 71 An (ordered) set of hits that match a search criterion produced by a search engine as the result of processing a query. 72 73 Search Engine:: 74 A software component within a repository, that allows for searching within the repository contents. 75 76 SRU:: 77 Search and Retrieve via URL, is a protocol for Internet search queries. Originally introduced by Library of Congress [#REF_LOC_SRU_12 LOC-SRU12], later standardization process moved to OASIS [#REF_SRU_12 OASIS-SRU12]. 78 79 == Normative References 80 RFC2119[=#REF_RFC_2119]:: 81 Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ 82 [http://www.ietf.org/rfc/rfc2119.txt] 83 84 XML-Namespaces[=#REF_XML_Namespaces]:: 85 Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009, \\ 86 [http://www.w3.org/TR/2009/REC-xml-names-20091208/] 87 88 OASIS-SRU-Overview[=#REF_SRU_Overview]:: 89 searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013, \\ 90 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc] 91 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html (HTML)], 92 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.pdf (PDF)] 93 94 OASIS-SRU-APD[=#REF_SRU_APD]:: 95 searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013, \\ 96 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc] 97 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html (HTML)] 98 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.pdf (PDF)] 99 100 OASIS-SRU12[=#REF_SRU_12]:: 101 searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013, \\ 102 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc] 103 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.html (HTML)] 104 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.pdf (PDF)] 105 106 OASIS-SRU20[=#REF_SRU_20]:: 107 searchRetrieve: Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0, OASIS, January 2013, \\ 108 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.doc] 109 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html (HTML)] 110 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.pdf (PDF)] 111 112 OASIS-CQL[=#REF_CQL]:: 113 searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013, \\ 114 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc] 115 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.html (HTML)] 116 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.pdf (PDF)] 117 118 SRU-Explain[=#REF_Explain]:: 119 searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013, \\ 120 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc] 121 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.html (HTML)] 122 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.pdf (PDF)] 123 124 SRU-Scan[=#REF_Scan]:: 125 searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013, \\ 126 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc] 127 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.html (HTML)] 128 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.PDF (PDF)] 129 130 LOC-SRU12[=#REF_LOC_SRU_12]:: 131 SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ 132 [http://www.loc.gov/standards/sru/sru-1-2.html] 133 134 LOC-DIAG[=#REF_LOC_DIAG]:: 135 SRU Version 1.2: SRU Diagnostics List, Library of Congress,\\ 136 [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] 137 138 UD-POS[=#REF_UD_POS]:: 139 Universal Dependencies, Universal POS tags, \\ 140 [https://universaldependencies.github.io/docs/u/pos/index.html] 141 142 SAMPA[=#REF_SAMPA]:: 143 Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7 144 145 CLARIN-FCS-!DataViews[=#REF_FCS_DataViews]:: 146 CLARIN Federated Content Search (CLARIN-FCS) - Data Views, SCCTC FCS Task-Force, April 2014, \\ 147 [https://trac.clarin.eu/wiki/FCS/Dataviews] 148 149 == Non-Normative References 150 CQP-TUTORIAL[=#REF_CQP_Tutorial]:: 151 Evert et al.: The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, CWB Version 3.0, February 2010, \\ 152 [http://cwb.sourceforge.net/files/CQP_Tutorial/] 153 154 RFC6838[=#REF_RFC_6838]:: 155 Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ 156 [http://www.ietf.org/rfc/rfc6838.txt] 157 158 RFC3023[=#REF_RFC_3023]:: 159 XML Media Types, IETF RFC 3023, January 2001, \\ 160 [http://www.ietf.org/rfc/rfc3023.txt] 161 162 == Typographic and XML Namespace conventions 18 == Glossary == 19 Aggregator:: A module or service to dispatch queries to repositories and collect results. 20 21 [=#REF_Annotation_Layer Annotation Layer]:: An annotation layer is the sum of possible annotations for a language resource, such as part of speech or orthographic transcription. Usually it is related to a given annotation task or topic. For the scope of the specification it is used as synonym for annotation tier. 22 23 CLARIN-FCS, FCS:: CLARIN federated content search, an interface specification to allow searching within resource content of repositories. 24 25 Client:: A software component, which implements the interface specification to query Endpoints, i.e. an aggregator or a user-interface. 26 27 CQL:: Contextual Query Language, previously known as Common Query Language, is a domain specific language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information. 28 29 Data View:: A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation. 30 31 Data View Payload, Payload:: The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation. 32 33 Endpoint:: A software component, which implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine. 34 35 FCS-QL:: Federated Content Search Query Language is the query language used in the advanced CLARIN-FCS profile. It is derived from Corpus Workbench's [#REF_CQP_Tutorial CQP-TUTORIAL] 36 37 Hit:: A piece of data returned by a Search Engine that matches the search criterion. What is considered a Hit highly depends on Search Engine. 38 39 Interface Specification:: Common harmonized interface and suite of protocols that repositories need to implement. 40 41 Layer:: See [#REF_Annotation_Layer "'Annotation Layer'"] 42 43 PID:: A Persistent identifier is a long-lasting reference to a digital object. 44 45 Repository:: A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata). 46 47 Repository Registry:: A separate service that allows registering Repositories and their Endpoints and provides information about these to other components, e.g. an Aggregator. The [http://centres.clarin.eu/ CLARIN Center Registry] is an implementation of such a repository registry. 48 49 Resource:: A searchable and addressable entity at an Endpoint, such as a text corpus or a multi-modal corpus. 50 51 Resource Fragment:: A smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription. 52 53 Result Set:: An (ordered) set of hits that match a search criterion produced by a search engine as the result of processing a query. 54 55 Search Engine:: A software component within a repository, that allows for searching within the repository contents. 56 57 SRU:: Search and Retrieve via URL, is a protocol for Internet search queries. Originally introduced by Library of Congress [#REF_LOC_SRU_12 LOC-SRU12], later standardization process moved to OASIS [#REF_SRU_12 OASIS-SRU12]. 58 59 == Normative References == 60 RFC2119[=#REF_RFC_2119]:: Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ [http://www.ietf.org/rfc/rfc2119.txt] 61 62 XML-Namespaces[=#REF_XML_Namespaces]:: Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009, \\ [http://www.w3.org/TR/2009/REC-xml-names-20091208/] 63 64 OASIS-SRU-Overview[=#REF_SRU_Overview]:: searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html (HTML)], [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.pdf (PDF)] 65 66 OASIS-SRU-APD[=#REF_SRU_APD]:: searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.pdf (PDF)] 67 68 OASIS-SRU12[=#REF_SRU_12]:: searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.pdf (PDF)] 69 70 OASIS-SRU20[=#REF_SRU_20]:: searchRetrieve: Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.pdf (PDF)] 71 72 OASIS-CQL[=#REF_CQL]:: searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.pdf (PDF)] 73 74 SRU-Explain[=#REF_Explain]:: searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.pdf (PDF)] 75 76 SRU-Scan[=#REF_Scan]:: searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.PDF (PDF)] 77 78 LOC-SRU12[=#REF_LOC_SRU_12]:: SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ [http://www.loc.gov/standards/sru/sru-1-2.html] 79 80 LOC-DIAG[=#REF_LOC_DIAG]:: SRU Version 1.2: SRU Diagnostics List, Library of Congress,\\ [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] 81 82 UD-POS[=#REF_UD_POS]:: Universal Dependencies, Universal POS tags v2.0, \\ [https://universaldependencies.org/u/pos/index.html] 83 84 SAMPA[=#REF_SAMPA]:: Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7 85 86 CLARIN-FCS-!DataViews[=#REF_FCS_DataViews]:: CLARIN Federated Content Search (CLARIN-FCS) - Data Views, SCCTC FCS Task-Force, April 2014, \\ [https://trac.clarin.eu/wiki/FCS/Dataviews] 87 88 == Non-Normative References == 89 CQP-TUTORIAL[=#REF_CQP_Tutorial]:: Evert et al.: The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, CWB Version 3.0, February 2010, \\ [http://cwb.sourceforge.net/files/CQP_Tutorial/] 90 91 RFC6838[=#REF_RFC_6838]:: Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ [http://www.ietf.org/rfc/rfc6838.txt] 92 93 RFC3023[=#REF_RFC_3023]:: XML Media Types, IETF RFC 3023, January 2001, \\ [http://www.ietf.org/rfc/rfc3023.txt] 94 95 == Typographic and XML Namespace conventions == 163 96 The following typographic conventions for XML fragments will be used throughout this specification: 97 164 98 * `<prefix:Element>` \\ An XML element with the Generic Identifier ''Element'' that is bound to an XML namespace denoted by the prefix ''prefix''. 165 99 * `@attr` \\ An XML attribute with the name ''attr'' 100 166 101 {{{#!comment 102 167 103 * `@prefix:attr` \\ An XML attribute with the name ''attr'' that is bound to an XML namespaces denoted by the prefix ''prefix''. 168 }}} 104 105 }}} 106 169 107 * `string` \\ The literal ''string'' must be used either as element content or attribute value. 108 170 109 Endpoints and Clients `MUST` adhere to the [#REF_XML_Namespaces XML-Namespaces] specification. The CLARIN-FCS interface specification generally does not dictate whether XML elements should be serialized in their prefixed or non-prefixed syntax, but Endpoints `MUST` ensure that the correct XML namespace is used for elements and that XML namespaces are declared correctly. Clients `MUST` be agnostic regarding syntax for serializing the XML elements, i.e. if the prefixed or un-prefixed variant was used, and `SHOULD` operate solely on ''expanded names'', i.e. pairs of ''namespace name'' and ''local name''. 171 110 172 111 The following XML namespace names and prefixes are used throughout this specification. The column "Recommended Syntax" indicates which syntax variant `SHOULD` be used by the Endpoint to serialize the XML response. 173 ||=Prefix =||=Namespace Name =||=Comment =||=Recommended Syntax =|| 174 || `fcs` || `http://clarin.eu/fcs/resource` || CLARIN-FCS Resources || prefixed || 175 || `ed` || `http://clarin.eu/fcs/endpoint-description` || CLARIN-FCS Endpoint Description || prefixed || 176 || `hits` || `http://clarin.eu/fcs/dataview/hits` || CLARIN-FCS Generic Hits Data View || prefixed || 177 || `adv` || `http://clarin.eu/fcs/dataview/advanced` || CLARIN-FCS Advanced Data View || prefixed || 178 || `sru` || `http://docs.oasis-open.org/ns/search-ws/sruResponse` || SRU Version 2.0 || prefixed || 179 || `diag` || `http://docs.oasis-open.org/ns/search-ws/diagnostic` || SRU Version 2.0 Diagnostics || prefixed || 180 || `zr` || `http://explain.z3950.org/dtd/2.0/` || SRU/ZeeRex Explain || prefixed || 181 || `sru` || `http://www.loc.gov/zing/srw/` || SRU Version 1.2, ''only compatibility mode'' || prefixed || 182 || `diag` || `http://www.loc.gov/zing/srw/diagnostic/` || SRU Version 1.2 Diagnostics, ''only compatibility mode'' || prefixed || 183 184 = CLARIN-FCS Interface Specification 112 113 ||=Prefix =||=Namespace Name =||=Comment =||=Recommended Syntax =|| 114 || `fcs` || `http://clarin.eu/fcs/resource` || CLARIN-FCS Resources || prefixed || 115 || `ed` || `http://clarin.eu/fcs/endpoint-description` || CLARIN-FCS Endpoint Description || prefixed || 116 || `hits` || `http://clarin.eu/fcs/dataview/hits` || CLARIN-FCS Generic Hits Data View || prefixed || 117 || `adv` || `http://clarin.eu/fcs/dataview/advanced` || CLARIN-FCS Advanced Data View || prefixed || 118 || `sru` || `http://docs.oasis-open.org/ns/search-ws/sruResponse` || SRU Version 2.0 || prefixed || 119 || `diag` || `http://docs.oasis-open.org/ns/search-ws/diagnostic` || SRU Version 2.0 Diagnostics || prefixed || 120 || `zr` || `http://explain.z3950.org/dtd/2.0/` || SRU/ZeeRex Explain || prefixed || 121 || `sru` || `http://www.loc.gov/zing/srw/` || SRU Version 1.2, ''only compatibility mode'' || prefixed || 122 || `diag` || `http://www.loc.gov/zing/srw/diagnostic/` || SRU Version 1.2 Diagnostics, ''only compatibility mode'' || prefixed || 123 124 = CLARIN-FCS Interface Specification = 185 125 The CLARIN-FCS Interface Specification defines a set of capabilities, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms. 186 126 187 127 Specifically, the CLARIN-FCS Interface Specification consists of two parts, a set of formats, and a transport protocol. The ''Endpoint'' component is a software component that acts as a bridge between a ''Client'' and a ''Search Engine'' and passes the requests sent by the ''Client'' to the ''Search Engine''. The ''Search Engine'' is a custom software component that allows the search of language resources in a Repository. The ''Endpoint'' implements the ''Transport Protocol'' and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of ''Search Engines'' of the individual Repositories. The following figure illustrates the overall architecture: 128 188 129 {{{ 189 130 +---------+ … … 216 157 In general, the work flow in CLARIN-FCS is as follows: a Client submits a query to an Endpoint. The Endpoint translates the query from CQL or FCS-QL to the query dialect used by the Search Engine and submits the translated query to the Search Engine. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends them to the Client. 217 158 218 == Discovery #Discovery159 == Discovery == #Discovery 219 160 The ''Discovery'' step allows a Client to gather information about an Endpoint, in particular which capabilities are supported or which resources are available for searching. 220 161 221 === Capabilities 162 === Capabilities === 222 163 A ''Capability'' defines a certain feature set that is part of CLARIN-FCS, e.g. what kind of queries are supported. Each Endpoint implements some (or all) of these Capabilities. The Endpoint will announce the capabilities it provides to allow a Client to auto-tune itself (see section [#endpointDescription Endpoint Description]). Each Capability is identified by a ''Capability Identifier'', which uses the URI syntax. The following Capabilities are defined in CLARIN-FCS: 223 ||=Name =||=Capability Identifier =||=Summary =|| 224 || ''Basic Search'' || `http://clarin.eu/fcs/capability/basic-search` || Simple full-text searching || 164 165 ||=Name =||=Capability Identifier =||=Summary =|| 166 || ''Basic Search'' || `http://clarin.eu/fcs/capability/basic-search` || Simple full-text searching || 225 167 || ''Advanced Search'' || `http://clarin.eu/fcs/capability/advanced-search` || Searching in structured and/or annotated data || 226 168 227 169 Endpoints `MUST` implement the ''Basic Search'' Capability. Endpoints `MUST NOT` invent custom Capability Identifiers and `MUST` only use the values defined above. 228 170 229 230 === Endpoint Description #endpointDescription 171 === Endpoint Description === #endpointDescription 231 172 {{{ 232 173 #!div style="border: 1px solid #000000; font-size: 75%" … … 236 177 237 178 The XML fragment for ''Endpoint Description'' is encoded as an `<ed:EndpointDescription>` element, that contains the following attributes and children: 179 238 180 * one `@version` attribute (`REQUIRED`) on the `<ed:EndpointDescription>` element. The value of the `@version` attribute `MUST` be `2`. 239 * one `<ed:Capabilities>` element (`REQUIRED`) that contains one or more `<ed:Capability>` elements \\ 240 The content of the `<ed:Capability>` element is a Capability Identifier, that indicates the capabilities, that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values. 241 * one `<ed:SupportedDataViews>` element (`REQUIRED`) \\ 242 A list of Data Views that are supported by this Endpoint. This list is composed of one or more `<ed:SupportedDataView>` elements. The content of a `<ed:SupportedDataView>` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. Each `<ed:SupportedDataView>` element `MUST` carry a `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). Endpoints `SHOULD` use the recommended short identifier for the Data View. The `@delivery-policy` indicates, the Endpoint's delivery policy, for that Data View. Valid values are `send-by-default` for the ''send-by-default'' and `need-to-request` for the ''need-to-request'' delivery policy. \\ 243 This list `MUST NOT` include duplicate entries, i.e. no MIME type must appear more than once. \\ 244 The value of the `@id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) 245 * one `<ed:SupportedLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability) \\ 246 A list of Layers that are generally supported by this Endpoint. This list is composed of one or more `<ed:SupportedLayer>` elements. The content of a `<ed:SupportedLayer>` `MUST` be the identifier of a Layer (see [#layers section "Layers"]), e.g. `orth`. Each `<ed:SupportedLayer>` element `MUST` carry an `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). The `@result-id` attribute is used in the Advanced Data View (see [#advancedDataView section "Advanced Data View"]). Each `<ed:SupportedLayer>` element `MAY` carry an optional `@qualifier` attribute. It is used a a qualifier in a FCS-QL search term in to address this specific layer. \\ 247 This list `MUST NOT` include duplicate entries, i.e. no Layer with the same `@result-id` MIME type must appear more than once. \\ 248 The value of the `@id` or `@result-id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) 249 The value of the `@qualifier` attribute `MUST NOT` contain characters other than `a`-`z`,`A`-`Z`,`0`-`9` and `-` (hyphen). 250 The `<ed:SupportedLayer>` element `MAY` carry an `@alt-value-info` and `@alt-value-info-uri` attribute; `@alt-value-info` `SHOULD` contain a sort description about the layer, e.g. the original tag set used; `@alt-value-info-uri` `MUST` contain a well-formed URI and `SHOULD` point to a web site with further information, e.g. about the original tag set and how the translation to FCS is done. Client, e.g. the Aggregator, can display this information together with the search result. 251 * one `<ed:Resources>` element (`REQUIRED`) \\ 252 A list of (top-level) resources that are available, i.e. searchable, at the Endpoint. The `<ed:Resources>` element contains one or more `<ed:Resource>` elements (see below). The Endpoint `MUST` declare at least one (top-level) resource. 181 * one `<ed:Capabilities>` element (`REQUIRED`) that contains one or more `<ed:Capability>` elements \\ The content of the `<ed:Capability>` element is a Capability Identifier, that indicates the capabilities, that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values. 182 * one `<ed:SupportedDataViews>` element (`REQUIRED`) \\ A list of Data Views that are supported by this Endpoint. This list is composed of one or more `<ed:SupportedDataView>` elements. The content of a `<ed:SupportedDataView>` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. Each `<ed:SupportedDataView>` element `MUST` carry a `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). Endpoints `SHOULD` use the recommended short identifier for the Data View. The `@delivery-policy` indicates, the Endpoint's delivery policy, for that Data View. Valid values are `send-by-default` for the ''send-by-default'' and `need-to-request` for the ''need-to-request'' delivery policy. \\ This list `MUST NOT` include duplicate entries, i.e. no MIME type must appear more than once. \\ The value of the `@id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) 183 * one `<ed:SupportedLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability) \\ A list of Layers that are generally supported by this Endpoint. This list is composed of one or more `<ed:SupportedLayer>` elements. The content of a `<ed:SupportedLayer>` `MUST` be the identifier of a Layer (see [#layers section "Layers"]), e.g. `orth`. Each `<ed:SupportedLayer>` element `MUST` carry an `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). The `@result-id` attribute is used in the Advanced Data View (see [#advancedDataView section "Advanced Data View"]). Each `<ed:SupportedLayer>` element `MAY` carry an optional `@qualifier` attribute. It is used a a qualifier in a FCS-QL search term in to address this specific layer. \\ This list `MUST NOT` include duplicate entries, i.e. no Layer with the same `@result-id` MIME type must appear more than once. \\ The value of the `@id` or `@result-id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) The value of the `@qualifier` attribute `MUST NOT` contain characters other than `a`-`z`,`A`-`Z`,`0`-`9` and `-` (hyphen). The `<ed:SupportedLayer>` element `MAY` carry an `@alt-value-info` and `@alt-value-info-uri` attribute; `@alt-value-info` `SHOULD` contain a sort description about the layer, e.g. the original tag set used; `@alt-value-info-uri` `MUST` contain a well-formed URI and `SHOULD` point to a web site with further information, e.g. about the original tag set and how the translation to FCS is done. Client, e.g. the Aggregator, can display this information together with the search result. 184 * one `<ed:Resources>` element (`REQUIRED`) \\ A list of (top-level) resources that are available, i.e. searchable, at the Endpoint. The `<ed:Resources>` element contains one or more `<ed:Resource>` elements (see below). The Endpoint `MUST` declare at least one (top-level) resource. 253 185 254 186 The `<ed:Resource>` element contains a basic description of a resource that is available at the Endpoint. A resource is a searchable entity, e.g. a single corpus. The `<ed:Resources>` has a mandatory `@pid` attribute that contains persistent identifier of the resource. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the resource. The `<ed:Resources>` element contains the following children: 255 * one or more `<ed:Title>` elements (`REQUIRED`) \\ 256 A human readable title for the resource. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language. 257 * zero or more `<ed:Description>` elements (`OPTIONAL`) \\ 258 An optional human-readable description of the resource. It `SHOULD` be at most one sentence. A `REQUIRED` `@xml:lang` attribute indicates the language of the description. If supplied, an English version of the description is `REQUIRED`. The list of descriptions `MUST NOT` contain duplicate entries for the same language. 259 * zero or one `<ed:LandingPageURI>` element (`OPTIONAL`) \\ 260 A link to a website for the resource, e.g. a landing page for a resource, i.e. a web-site that describes a corpus. 261 * one `<ed:Languages>` element (`REQUIRED`) \\ 262 The (relevant) languages available within the resource. The `<ed:Languages>` element contains one or more `<ed:Language>` elements. The content of a `<ed:Language>` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the resource, however this list `MUST NOT` contain duplicate entries. 263 * one `<ed:AvailableDataViews>` element (`REQUIRED`) \\ 264 The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\ 265 In case of sub-resources, each Resource `SHOULD` support all Data Views that are supported by the parent resource. However, every resource `MUST` declare all available Data Views independently, i.e. there is no implicit inheritance semantic. 266 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\ 267 In case of sub-resources, each Resource `SHOULD` support all Layers that are supported by the parent resource. However, every resource `MUST` declare all available Layers independently, i.e. there is no implicit inheritance semantic. 268 * zero or one `<ed:Resources>` element (`OPTIONAL`) \\ 269 If a resource has searchable sub-resources, the Endpoint `MUST` supply additional finer grained resource elements, which are wrapped in a `<ed:Resources>` element. A sub-resource is a searchable entity within a resource, e.g. a sub-corpus. 270 271 [=#REF_Example_4]Example 4: 272 {{{#!xml 273 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 274 <ed:Capabilities> 275 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 276 </ed:Capabilities> 277 <ed:SupportedDataViews> 278 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 279 </ed:SupportedDataViews> 280 <ed:Resources> 281 <!-- just one top-level resource at the Endpoint --> 282 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 283 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 284 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 285 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 286 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 287 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 288 <ed:Languages> 289 <ed:Language>deu</ed:Language> 290 </ed:Languages> 291 <ed:AvailableDataViews ref="hits" /> 292 </ed:Resource> 293 </ed:Resources> 294 </ed:EndpointDescription> 295 }}} 296 [#REF_Example_4 Example 4] shows a simple Endpoint Description for an Endpoint that only supports the ''Basic Search'' Capability and only provides the Generic Hits Data View, which is indicated by a `<ed:SupportedDataView>` element. This element carries a `@id` attribute with a value of `hits`, the recommended value for the short identifier, and indicates a delivery policy of ''send-by-default'' by the `@delivery-policy` attribute. It only provides one top-level resource identified by the persistent identifier `http://hdl.handle.net/4711/0815`. The resource has a title as well as a description in German and English. A landing page is located at `http://repos.example.org/corpus1.html`. The predominant language in the resource contents is German. Only the Generic Hits Data View is supported for this resource, because the `<ed:AvailableDataViews>` element only references the `<ed:SupporedDataView>` element with the `@id` with a value of `hits`. 297 298 [=#REF_Example_5]Example 5: 299 {{{#!xml 300 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 301 <ed:Capabilities> 302 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 303 </ed:Capabilities> 304 <ed:SupportedDataViews> 305 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 306 <ed:SupportedDataView id="cmdi" delivery-policy="need-to-request">application/x-cmdi+xml</ed:SupportedDataView> 307 </ed:SupportedDataViews> 308 <ed:Resources> 309 <!-- top-level resource 1 --> 310 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 311 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 312 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 313 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 314 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 315 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 316 <ed:Languages> 317 <ed:Language>deu</ed:Language> 318 </ed:Languages> 319 <ed:AvailableDataViews ref="hits" /> 320 </ed:Resource> 321 <!-- top-level resource 2 --> 322 <ed:Resource pid="http://hdl.handle.net/4711/0816"> 323 <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen</ed:Title> 324 <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus</ed:Title> 325 <ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI> 326 <ed:Languages> 327 <ed:Language>deu</ed:Language> 328 </ed:Languages> 329 <ed:AvailableDataViews ref="hits cmdi" /> 330 <ed:Resources> 331 <!-- sub-resource 1 of top-level resource 2 --> 332 <ed:Resource pid="http://hdl.handle.net/4711/0816-1"> 333 <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title> 334 <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (before 1990)</ed:Title> 335 <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI> 336 <ed:Languages> 337 <ed:Language>deu</ed:Language> 338 </ed:Languages> 339 <ed:AvailableDataViews ref="hits cmdi" /> 340 </ed:Resource> 341 <!-- sub-resource 2 of top-level resource 2 --> 342 <ed:Resource pid="http://hdl.handle.net/4711/0816-2"> 343 <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (nach 1990)</ed:Title> 344 <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (after 1990)</ed:Title> 345 <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub2</ed:LandingPageURI> 346 <ed:Languages> 347 <ed:Language>deu</ed:Language> 348 </ed:Languages> 349 <ed:AvailableDataViews ref="hits cmdi" /> 350 </ed:Resource> 351 </ed:Resources> 352 </ed:Resource> 353 </ed:Resources> 354 </ed:EndpointDescription> 355 }}} 356 The more complex [#REF_Example_5 Example 5] show an Endpoint Description for an Endpoint that, similar to [#REF_Example_4 Example 4], supports the ''Basic Search'' capability. In addition to the Generic Hits Data View, it also supports the CMDI Data View. The delivery polices are ''send-by-default'' for the Generic Hits Data View and ''need-to-request'' for the CMDI Data View. The Endpoint has two top-level resources (identified by the persistent identifiers `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816`. The second top-level resource has two independently searchable sub-resources, identified by the persistent identifier `http://hdl.handle.net/4711/0816-1` and `http://hdl.handle.net/4711/0816-2`. All resources are described using several properties, like title, description, etc. The first top-level resource provides only the Generic Hits Data View, while the other top-level resource including its children provide the Generic Hits and the CMDI Data Views. 357 358 [=#REF_Example_6]Example 6: 359 {{{#!xml 360 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 361 <ed:Capabilities> 362 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 363 <ed:Capability>http://clarin.eu/fcs/capability/advanced-search</ed:Capability> 364 </ed:Capabilities> 365 <ed:SupportedDataViews> 366 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 367 </ed:SupportedDataViews> 368 <!-- ADV-FCS --> 369 <SupportedLayers> 370 <SupportedLayer id="l1" result-id="http://endpoint.example.org/Layers/orth1">orth</SupportedLayer> 371 <SupportedLayer id="l2" result-id="http://endpoint.example.org/Layers/pos1" qualifier="x">pos</SupportedLayer> 372 <SupportedLayer id="l3" result-id="http://endpoint.example.org/Layers/pos2" qualifier="y" 373 alt-value-info="STTS tagset" 374 alt-value-info-uri="http://repos.example.org/tagset_doc.html">pos</SupportedLayer> 375 <SupportedLayer id="l4" result-id="http://endpoint.example.org/Layers/word" type="empty">word</SupportedLayer> 376 <SupportedLayer id="l5" result-id="http://endpoint.example.org/Layers/lemma1">lemma</SupportedLayer> 377 </SupportedLayers> 378 379 <ed:Resources> 380 <!-- just one top-level resource at the Endpoint --> 381 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 382 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 383 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 384 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 385 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 386 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 387 <ed:Languages> 388 <ed:Language>deu</ed:Language> 389 </ed:Languages> 390 <ed:AvailableDataViews ref="hits" /> 391 <AvailableLayers ref="l1 l2 l3 l4 l5" /> 392 </ed:Resource> 393 </ed:Resources> 394 </ed:EndpointDescription> 395 }}} 187 188 * one or more `<ed:Title>` elements (`REQUIRED`) \\ A human readable title for the resource. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language. 189 * zero or more `<ed:Description>` elements (`OPTIONAL`) \\ An optional human-readable description of the resource. It `SHOULD` be at most one sentence. A `REQUIRED` `@xml:lang` attribute indicates the language of the description. If supplied, an English version of the description is `REQUIRED`. The list of descriptions `MUST NOT` contain duplicate entries for the same language. 190 * zero or one `<ed:LandingPageURI>` element (`OPTIONAL`) \\ A link to a website for the resource, e.g. a landing page for a resource, i.e. a web-site that describes a corpus. 191 * one `<ed:Languages>` element (`REQUIRED`) \\ The (relevant) languages available within the resource. The `<ed:Languages>` element contains one or more `<ed:Language>` elements. The content of a `<ed:Language>` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the resource, however this list `MUST NOT` contain duplicate entries. 192 * one `<ed:AvailableDataViews>` element (`REQUIRED`) \\ The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\ In case of sub-resources, each Resource `SHOULD` support all Data Views that are supported by the parent resource. However, every resource `MUST` declare all available Data Views independently, i.e. there is no implicit inheritance semantic. 193 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\ In case of sub-resources, each Resource `SHOULD` support all Layers that are supported by the parent resource. However, every resource `MUST` declare all available Layers independently, i.e. there is no implicit inheritance semantic. 194 * zero or one `<ed:Resources>` element (`OPTIONAL`) \\ If a resource has searchable sub-resources, the Endpoint `MUST` supply additional finer grained resource elements, which are wrapped in a `<ed:Resources>` element. A sub-resource is a searchable entity within a resource, e.g. a sub-corpus. 195 196 [=#REF_Example_4]Example 4: {{{#!xml <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 197 198 <ed:Capabilities> 199 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > 200 </ed:Capabilities > <ed:SupportedDataViews> 201 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > 202 </ed:SupportedDataViews > <ed:Resources> 203 <!-- just one top-level resource at the Endpoint --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 204 <ed:Title xml:lang="de"> Goethe Korpus</ed:Title > <ed:Title xml:lang="en"> Goethe corpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 205 <ed:Language> deu</ed:Language > 206 </ed:Languages > <ed:AvailableDataViews ref="hits" /> 207 </ed:Resource > 208 </ed:Resources > 209 210 </ed:EndpointDescription> }}} [#REF_Example_4 Example 4] shows a simple Endpoint Description for an Endpoint that only supports the ''Basic Search'' Capability and only provides the Generic Hits Data View, which is indicated by a `<ed:SupportedDataView>` element. This element carries a `@id` attribute with a value of `hits`, the recommended value for the short identifier, and indicates a delivery policy of ''send-by-default'' by the `@delivery-policy` attribute. It only provides one top-level resource identified by the persistent identifier `http://hdl.handle.net/4711/0815`. The resource has a title as well as a description in German and English. A landing page is located at `http://repos.example.org/corpus1.html`. The predominant language in the resource contents is German. Only the Generic Hits Data View is supported for this resource, because the `<ed:AvailableDataViews>` element only references the `<ed:SupporedDataView>` element with the `@id` with a value of `hits`. 211 212 [=#REF_Example_5]Example 5: {{{#!xml <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 213 214 <ed:Capabilities> 215 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > 216 </ed:Capabilities > <ed:SupportedDataViews> 217 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > <ed:SupportedDataView id="cmdi" delivery-policy="need-to-request"> application/x-cmdi+xml</ed:SupportedDataView > 218 </ed:SupportedDataViews > <ed:Resources> 219 <!-- top-level resource 1 --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 220 <ed:Title xml:lang="de"> Goethe Korpus</ed:Title > <ed:Title xml:lang="en"> Goethe corpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 221 <ed:Language> deu</ed:Language > 222 </ed:Languages > <ed:AvailableDataViews ref="hits" /> 223 </ed:Resource > <!-- top-level resource 2 --> <ed:Resource pid="http://hdl.handle.net/4711/0816"> 224 <ed:Title xml:lang="de"> Zeitungskorpus des Mannheimer Morgen</ed:Title > <ed:Title xml:lang="en"> Mannheimer Morgen newspaper corpus</ed:Title > <ed:LandingPageURI> http://repos.example.org/corpus2.html </ed:LandingPageURI > <ed:Languages> 225 <ed:Language> deu</ed:Language > 226 </ed:Languages > <ed:AvailableDataViews ref="hits cmdi" /> <ed:Resources> 227 <!-- sub-resource 1 of top-level resource 2 --> <ed:Resource pid="http://hdl.handle.net/4711/0816-1"> 228 <ed:Title xml:lang="de"> Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title > <ed:Title xml:lang="en"> Mannheimer Morgen newspaper corpus (before 1990)</ed:Title > <ed:LandingPageURI> http://repos.example.org/corpus2.html#sub1 </ed:LandingPageURI > <ed:Languages> 229 <ed:Language> deu</ed:Language > 230 </ed:Languages > <ed:AvailableDataViews ref="hits cmdi" /> 231 </ed:Resource > <!-- sub-resource 2 of top-level resource 2 --> <ed:Resource pid="http://hdl.handle.net/4711/0816-2"> 232 <ed:Title xml:lang="de"> Zeitungskorpus des Mannheimer Morgen (nach 1990)</ed:Title > <ed:Title xml:lang="en"> Mannheimer Morgen newspaper corpus (after 1990)</ed:Title > <ed:LandingPageURI> http://repos.example.org/corpus2.html#sub2 </ed:LandingPageURI > <ed:Languages> 233 <ed:Language> deu</ed:Language > 234 </ed:Languages > <ed:AvailableDataViews ref="hits cmdi" /> 235 </ed:Resource > 236 </ed:Resources > 237 </ed:Resource > 238 </ed:Resources > 239 240 </ed:EndpointDescription> }}} The more complex [#REF_Example_5 Example 5] show an Endpoint Description for an Endpoint that, similar to [#REF_Example_4 Example 4], supports the ''Basic Search'' capability. In addition to the Generic Hits Data View, it also supports the CMDI Data View. The delivery polices are ''send-by-default'' for the Generic Hits Data View and ''need-to-request'' for the CMDI Data View. The Endpoint has two top-level resources (identified by the persistent identifiers `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816`. The second top-level resource has two independently searchable sub-resources, identified by the persistent identifier `http://hdl.handle.net/4711/0816-1` and `http://hdl.handle.net/4711/0816-2`. All resources are described using several properties, like title, description, etc. The first top-level resource provides only the Generic Hits Data View, while the other top-level resource including its children provide the Generic Hits and the CMDI Data Views. 241 242 [=#REF_Example_6]Example 6: {{{#!xml <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 243 244 <ed:Capabilities> 245 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > <ed:Capability> http://clarin.eu/fcs/capability/advanced-search </ed:Capability > 246 </ed:Capabilities > <ed:SupportedDataViews> 247 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > 248 </ed:SupportedDataViews > <!-- ADV-FCS --> <SupportedLayers > 249 <SupportedLayer id="l1" result-id="http://endpoint.example.org/Layers/orth1 ">orth</SupportedLayer > <SupportedLayer id="l2" result-id="http://endpoint.example.org/Layers/pos1 " qualifier="x">pos</SupportedLayer > <SupportedLayer id="l3" result-id="http://endpoint.example.org/Layers/pos2 " qualifier="y" 250 alt-value-info="STTS tagset" alt-value-info-uri="http://repos.example.org/tagset_doc.html ">pos</SupportedLayer > 251 <SupportedLayer id="l4" result-id="http://endpoint.example.org/Layers/word " type="empty">word</SupportedLayer > <SupportedLayer id="l5" result-id="http://endpoint.example.org/Layers/lemma1 ">lemma</SupportedLayer > 252 </SupportedLayers > 253 254 <ed:Resources> 255 <!-- just one top-level resource at the Endpoint --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 256 <ed:Title xml:lang="de"> Goethe Korpus</ed:Title > <ed:Title xml:lang="en"> Goethe corpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 257 <ed:Language> deu</ed:Language > 258 </ed:Languages > <ed:AvailableDataViews ref="hits" /> <AvailableLayers ref="l1 l2 l3 l4 l5" /> 259 </ed:Resource > 260 </ed:Resources > 261 262 </ed:EndpointDescription> }}} 263 396 264 {{{ 397 265 #!div style="border: 1px solid #000000; font-size: 75%" 398 266 TODO: describe the above example 399 267 }}} 400 401 == Searching 268 == Searching == 402 269 In the ''Searching'' step the Client performs the actual search request to a previously [#Discovery discovered] Endpoint. 403 270 404 === Basic Search #basicSearch271 === Basic Search === #basicSearch 405 272 The ''Basic Search'' capability provides simple full-text search. Queries in Basic Search `MUST` be performed in the ''Contextual Query Language'' ([#REF_CQL OASIS-CQL]). The Endpoint `MUST` support ''term-only'' queries. The Endpoint `SHOULD` support ''terms'' combined with boolean operator queries (''AND'' and ''OR''), including sub-queries. An Endpoint `MAY` also support ''NOT'' or ''PROX'' operator queries. If an Endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). 406 273 … … 408 275 409 276 Examples of valid CQL queries for Basic Search are: 277 410 278 {{{ 411 279 cat … … 417 285 cat AND (mouse OR "lazy dog") 418 286 }}} 419 420 '''NOTE''': In CQL, a ''term'' can be a single token or a phrase, i.e. tokens separated by spaces. If a single ''term'' contains spaces, it needs to be quoted. \\ 421 '''NOTE''': Endpoints `MUST` be able to parse all of CQL. If they don't support a certain CQL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). Especially, if an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc. 422 423 === Advanced Search 287 '''NOTE''': In CQL, a ''term'' can be a single token or a phrase, i.e. tokens separated by spaces. If a single ''term'' contains spaces, it needs to be quoted. \\ '''NOTE''': Endpoints `MUST` be able to parse all of CQL. If they don't support a certain CQL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). Especially, if an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc. 288 289 === Advanced Search === 424 290 The ''Advanced Search'' capability allows searching in annotated data, that is represented in annotation layers. An annotation ''layer'' contains annotations of a specific type, e.g. lemma or part-of-speech layer. Queries can be across annotation layer. 425 291 426 292 CLARIN-FCS defined a set of searchable annotation layers with certain semantics and syntax. Endpoints `SHOULD` support as many different, of course depending on the resource type, annotation layers as possible. 427 293 428 ==== Layers #layers294 ==== Layers ==== #layers 429 295 Each Layer is assumed to be ''segmented'', e.g. to allow for searching for a single lemma. However, CLARIN-FCS does not endorse a specific segmentation, i.e. the segmentation of Layers is in the domain of the Endpoint and ''opaque'' to CLARIN-FCS. CLARIN-FCS '''does not''' endorse nor assume a ''formal linguistic relation'' or ''formal linguistic hierarchy'' between two items on two different layers. 430 296 431 ||=Layer Type Identifier =||=Annotation Layer Description =||=Syntax =||=Examples (without quotes) =|| 432 || `text` || Textual representation of resource, also the layer that is used in [#basicSearch Basic Search] || ''String'' || "Dog", "cat" "walking", "better" || 433 || `lemma` || Lemmatisation || ''String'' || "good", "walk", "dog" || 434 || `pos` || Part-of-Speech annotations || [#REF_UD_POS Universal POS tags] || "NOUN", "VERB", "ADJ" || 435 || `orth` || Orthographic transcription of (mostly) spoken resources || ''String'' || "dug", "cat", "wolking" || 436 || `norm` || Orthographic normalization of (mostly) spoken resources || ''String'' || "dog", "cat", "walking", "best" || 437 || `phonetic` || Phonetic transcription || [#REF_SAMPA SAMPA] || "'du:", "'vi:-d6 'ha:-b@n" || 438 439 The column ''Layer Type Identifier'' denotes the identifier for a layer. It is used in [#fcsQL FCS-QL] queries and the XML serialization for the [#advancedDataView Advanced Data View]. All valid identifiers are defined in the table above, all other identifiers are reserved and `MUST NOT` be used. Clients and Endpoints `MAY` create custom Layer Type Identifiers, e.g. for testing proposed. If they so so, the custom Layer Type identifiers `MUST` start with the String `x-`, e.g. `x-customLayer`. 440 The column ''Syntax'' describes the inventory of symbols that a Client `MUST` use with a corresponding annotation layer; the value ''String'' denotes that symbols are arbitrary Unicode Strings, i.e. no fixed inventory of symbols are defined. An Endpoint `SHOULD` provide an appropriate error, if a Client used an invalid value. 441 442 ==== FCS-QL #fcsQL 297 ||=Layer Type Identifier =||=Annotation Layer Description =||=Syntax =||=Examples (without quotes) =|| 298 || `text` || Textual representation of resource, also the layer that is used in [#basicSearch Basic Search] || ''String'' || "Dog", "cat" "walking", "better" || 299 || `lemma` || Lemmatisation || ''String'' || "good", "walk", "dog" || 300 || `pos` || Part-of-Speech annotations || [#REF_UD_POS Universal POS tags] || "NOUN", "VERB", "ADJ" || 301 || `orth` || Orthographic transcription of (mostly) spoken resources || ''String'' || "dug", "cat", "wolking" || 302 || `norm` || Orthographic normalization of (mostly) spoken resources || ''String'' || "dog", "cat", "walking", "best" || 303 || `phonetic` || Phonetic transcription || [#REF_SAMPA SAMPA] || "'du:", "'vi:-d6 'ha:-b@n" || 304 305 The column ''Layer Type Identifier'' denotes the identifier for a layer. It is used in [#fcsQL FCS-QL] queries and the XML serialization for the [#advancedDataView Advanced Data View]. All valid identifiers are defined in the table above, all other identifiers are reserved and `MUST NOT` be used. Clients and Endpoints `MAY` create custom Layer Type Identifiers, e.g. for testing proposed. If they so so, the custom Layer Type identifiers `MUST` start with the String `x-`, e.g. `x-customLayer`. The column ''Syntax'' describes the inventory of symbols that a Client `MUST` use with a corresponding annotation layer; the value ''String'' denotes that symbols are arbitrary Unicode Strings, i.e. no fixed inventory of symbols are defined. An Endpoint `SHOULD` provide an appropriate error, if a Client used an invalid value. 306 307 ==== FCS-QL ==== #fcsQL 443 308 {{{ 444 309 #!div style="border: 1px solid #000000; font-size: 75%" … … 450 315 451 316 Examples of valid FCS-QL queries for ''Advanced Search'' are: 317 452 318 {{{ 453 319 "walking" … … 463 329 [z:pos = "ADJ" & q:pos = "ADJ"] 464 330 }}} 465 466 The qualifiers ''z'' in ''z:pos'' and ''q'' in ''q:pos'' `SHOULD` match an available qualifier attribute value in a ''pos''-{{{SupportedLayer}}} in a discovered ''EndpointDescripion''. 467 468 469 '''NOTE''': Endpoints supporting ''Advanced Search'' `MUST` be able to parse all of FCS-QL. If they don't support a certain FCS-QL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). If an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include FCS-QL features.\\ 470 '''NOTE''': FCS-QL layer identifiers are reserved. The Endpoint `MUST` prepend the local prefix {{{x-}}} to any identifier used outside of the reserved set, e.g., {{{x-customLayer}}} for a local identifier {{{customLayer}}}. 471 472 473 === Result Format 331 The qualifiers ''z'' in ''z:pos'' and ''q'' in ''q:pos'' `SHOULD` match an available qualifier attribute value in a ''pos''-`SupportedLayer` in a discovered ''EndpointDescripion''. 332 333 '''NOTE''': Endpoints supporting ''Advanced Search'' `MUST` be able to parse all of FCS-QL. If they don't support a certain FCS-QL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). If an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include FCS-QL features.\\ '''NOTE''': FCS-QL layer identifiers are reserved. The Endpoint `MUST` prepend the local prefix `x-` to any identifier used outside of the reserved set, e.g., `x-customLayer` for a local identifier `customLayer`. 334 335 === Result Format === 474 336 {{{ 475 337 #!div style="border: 1px solid #000000; font-size: 75%" … … 480 342 CLARIN-FCS uses a customized format for returning results. ''Resource'' and ''Resource Fragments'' serve as containers for hit results, which are presented in one or more ''Data View''. The following section describes the Resource format and Data View format and section [#searchRetrieve Operation ''searchRetrieve''] will describe, how hits are embedded within SRU responses. 481 343 482 ==== Resource and !ResourceFragment 344 ==== Resource and !ResourceFragment ==== 483 345 To encode search results, CLARIN-FCS supports two building blocks: 484 Resources:: 485 A ''Resource'' is a ''searchable'' and ''addressable'' entity at the Endpoint, such as a text corpus or a multi-modal corpus. A resource `SHOULD` be a self-contained unit, i.e. not a single sentence in a text corpus or a time interval in an audio transcription, but rather a complete document from a text corpus or a complete audio transcription. 486 Resource Fragments:: 487 A ''Resource Fragment'' is a smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription. 346 347 Resources:: A ''Resource'' is a ''searchable'' and ''addressable'' entity at the Endpoint, such as a text corpus or a multi-modal corpus. A resource `SHOULD` be a self-contained unit, i.e. not a single sentence in a text corpus or a time interval in an audio transcription, but rather a complete document from a text corpus or a complete audio transcription. 348 Resource Fragments:: A ''Resource Fragment'' is a smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription. 488 349 489 350 A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit (for example if the hit is a sentence within a large text). A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If the Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that within the Resource Fragment. … … 501 362 Endpoints `MAY` serialize hits as multiple Data Views, however they `MUST` provide the Generic Hits (HITS) Data View either encoded as a Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable Resource Fragment). Other Data Views `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata Data View would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment. 502 363 503 [=#REF_Example_1]Example 1: 504 {{{#!xml 505 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15"> 364 [=#REF_Example_1]Example 1: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15"> 365 506 366 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 507 <!-- data view payload omitted --> 508 </fcs:DataView> 509 </fcs:Resource> 510 }}} 511 [#REF_Example_1 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. 512 513 [=#REF_Example_2]Example 2: 514 {{{#!xml 515 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15"> 367 <!-- data view payload omitted --> 368 </fcs:DataView > 369 370 </fcs:Resource> }}} [#REF_Example_1 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. 371 372 [=#REF_Example_2]Example 2: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15"> 373 516 374 <fcs:ResourceFragment> 517 375 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 518 376 <!-- data view payload omitted --> 519 </fcs:DataView> 520 </fcs:ResourceFragment> 521 </fcs:Resource> 522 }}} 523 [#REF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the Endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. 524 525 [=#REF_Example_3]Example 3: 526 {{{#!xml 527 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" 528 pid="http://hdl.handle.net/4711/08-15" ref="http://repos.example.org/file/text_08_15.html"> 529 <fcs:DataView type="application/x-cmdi+xml" 530 pid="http://hdl.handle.net/4711/08-15-1" ref="http://repos.example.org/file/08_15_1.cmdi"> 531 <!-- data view payload omitted --> 532 </fcs:DataView> 533 <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" ref="http://repos.example.org/file/text_08_15.html#sentence2"> 377 </fcs:DataView > 378 </fcs:ResourceFragment > 379 380 </fcs:Resource> }}} [#REF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the Endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. 381 382 [=#REF_Example_3]Example 3: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" 383 384 pid="http://hdl.handle.net/4711/08-15 " ref="http://repos.example.org/file/text_08_15.html "> 385 386 <fcs:DataView type="application/x-cmdi+xml" 387 388 pid="http://hdl.handle.net/4711/08-15-1 " ref="http://repos.example.org/file/08_15_1.cmdi "> 389 390 <!-- data view payload omitted --> 391 392 </fcs:DataView > <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" ref="http://repos.example.org/file/text_08_15.html#sentence2"> 534 393 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 535 394 <!-- data view payload omitted --> 536 </fcs:DataView> 537 </fcs:ResourceFragment> 538 </fcs:Resource> 539 }}} 540 The more complex [#REF_Example_3 Example 3] is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, which is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. The Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. 541 All entities of the Hit can be referenced by a persistent identifier and a URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`. 542 543 ==== Data View 395 </fcs:DataView > 396 </fcs:ResourceFragment > 397 398 </fcs:Resource> }}} The more complex [#REF_Example_3 Example 3] is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, which is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. The Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. All entities of the Hit can be referenced by a persistent identifier and a URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`. 399 400 ==== Data View ==== 544 401 A ''Data View'' serves as a container for encoding the actual search results (the data fragments relevant to search) within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e. they are deliberately kept open to allow further extensions with more supported Data View formats. This specification only defines a ''most basic'' Data View for representing search results, called ''Generic Hits'' (see below). More Data Views are defined in the supplementary specification [#REF_FCS_DataViews CLARIN-FCS-DataViews]. 545 402 … … 556 413 '''NOTE''': The examples in the following sections ''show only'' the payload with the enclosing `<fcs:DataView>` element of a Data View. Of course, the Data View must be embedded either in a `<fcs:Resource>` or a `<fcs:ResourceFragment>` element. The `@pid` and `@ref` attributes have been omitted for all ''inline'' payload types. 557 414 558 ===== Generic Hits (HITS) 559 ||=Description 560 ||=MIME type 561 ||=Payload Disposition 562 ||=Payload Delivery 415 ===== Generic Hits (HITS) ===== 416 ||=Description =|| The representation of the hit || 417 ||=MIME type =|| `application/x-clarin-fcs-hits+xml` || 418 ||=Payload Disposition =|| ''inline'' || 419 ||=Payload Delivery =|| ''send-by-default'' (`REQUIRED`) || 563 420 ||=Recommended Short Identifier =|| `hits` (`RECOMMENDED`) || 564 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Hits.xsd DataView-Hits.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Hits.xsd?format=txt download]) || 421 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Hits.xsd DataView-Hits.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Hits.xsd?format=txt download]) || 422 565 423 The ''Generic Hits'' Data View serves as the ''most basic'' agreement in CLARIN-FCS for serialization of search results and `MUST` be implemented by all Endpoints. In many cases, this Data View can only serve as an (lossy) approximation, because resources at Endpoints are very heterogeneous. For instance, the Generic Hits Data View is probably not the best representation for a hit result in a corpus of spoken language, but an architecture like CLARIN-FCS requires one common representation to be implemented by all Endpoints, therefore this Data View was defined. The Generic Hits Data View supports multiple markers for supplying highlighting for an individual hit, e.g. if a query contains a (boolean) conjunction, the Endpoint can use multiple markers to provide individual highlights for the matching terms. An Endpoint `MUST NOT` use this Data View to aggregate several hits within one resource. Each hit `SHOULD` be presented within the context of a complete sentence. If that is not possible due to the nature of the type of the resource, the Endpoint `MUST` provide an equivalent reasonable unit of context (e.g. within a phrase of an orthographic transcription of an utterance). The `<hits:Hit>` element within the `<hits:Result>` element is not enforced by the XML schema, but Endpoints are `RECOMMENDED` to use it. The XML fragment of the Generic Hits payload `MUST` be valid according to the XML schema "[source:FederatedSearch/schema/Core_2/DataView-Hits.xsd DataView-Hits.xsd]" ([source:FederatedSearch/schema/Core_2/DataView-Hits.xsd?format=txt download]). 424 566 425 * Example (single hit marker): 567 {{{#!xml 568 <!-- potential @pid and @ref attributes omitted -->569 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 426 427 {{{#!xml <!-- potential @pid and @ref attributes omitted --> <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 428 570 429 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 571 The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog. 572 </hits:Result> 573 </fcs:DataView> 574 }}} 430 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy dog. 431 </hits:Result > 432 433 </fcs:DataView> }}} 434 575 435 * Example (multiple hit markers): 576 {{{#!xml 577 <!-- potential @pid and @ref attributes omitted -->578 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 436 437 {{{#!xml <!-- potential @pid and @ref attributes omitted --> <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 438 579 439 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 580 The quick brown <hits:Hit> fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>.581 </hits:Result >582 </fcs:DataView> 583 }}}584 585 ===== Advanced (ADV) #advancedDataView586 ||=Description 587 ||=MIME type 588 ||=Payload Disposition 589 ||=Payload Delivery 440 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy <hits:Hit> dog</hits:Hit >. 441 </hits:Result > 442 443 </fcs:DataView> }}} 444 445 ===== Advanced (ADV) ===== #advancedDataView 446 ||=Description =|| The representation of the hit for Advanced Search || 447 ||=MIME type =|| `application/x-clarin-fcs-adv+xml` || 448 ||=Payload Disposition =|| ''inline'' || 449 ||=Payload Delivery =|| ''send-by-default'' (`REQUIRED`) || 590 450 ||=Recommended Short Identifier =|| `adv` (`RECOMMENDED`) || 591 ||=XML Schema 451 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd DataView-Advanced.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd?format=txt download]) || 592 452 593 453 {{{ … … 595 455 TODO: describe! 596 456 }}} 597 598 - ADV Data View allows to return structured information for Advanced Search queries 599 - organized in one or more annotation layers 600 - annotation layer := annotations of a specific type, e.g. part-of-speech or orthographic transcription 601 - annotations of two different annotation layers may freely overlap; no self-overlap in an annotation layer 602 - Data View serialization in a stand-off like format, i.e annotations are ranges over the signal (= language resource as character, token or audio stream) are denoted by start and end offsets 603 - layers alignable (through their offsets) and referable (trough their layer identifier) 604 - ADV Data View serialization: 605 - a list of segments (= "inventory" of all ranges used to describe annotations") 606 - units can be "items" (= offsets in character or token-stream) or "timestamp" (timestamps in audio-stream), timestamps may have a resolution of up to 1/1000 second. 607 - endpoints are responsible for choosing proper offsets for segments. they must do so in a consistent manner, i.e. in a single result (= ADV Data View instance) the chosen offsets must allow for aligning the segments of different layers. a recommendation for character streams: character := Unicode codepoint, normalized to Unicode Normalization Form KC (NFKC; Compatibility Decomposition, followed by Canonical Composition) 608 - segments may also have an endpoint specific reference (= URI); can be show in aggregator and if user clicks link can open a viewer (e.g. audio-player) at the endpoint 609 - a list of layers, each has a type (e.g. "pos", "lemma", see Layer Type identifier in section Layers above) and an layer identifier (= URI) 610 - a layer consists of one or more Spans. A span references a segment (and thus inherits the start- and en- offsets) and contains the actual annotation (e.g. the port-of-speech label) in it's content; MAY also contain alt-value (e.g. original annotation value) 611 - document order of layer elements define the view order in the Aggregator 612 - endpoints should return at least all layers that where referenced the query; they may return more 613 - Hit Makers are added by marking Spans as hits (add `@highlight` attribute); multiple hit-makers are supported and Aggregator may display them visually distinct 614 - where to add hit markers is up to the endpoint; generally "things" that where referenced in the query should be marked. 615 457 * ADV Data View allows to return structured information for Advanced Search queries 458 * organized in one or more annotation layers 459 * annotation layer := annotations of a specific type, e.g. part-of-speech or orthographic transcription 460 * annotations of two different annotation layers may freely overlap; no self-overlap in an annotation layer 461 * Data View serialization in a stand-off like format, i.e annotations are ranges over the signal (= language resource as character, token or audio stream) are denoted by start and end offsets 462 * layers alignable (through their offsets) and referable (trough their layer identifier) 463 * ADV Data View serialization: 464 * a list of segments (= "inventory" of all ranges used to describe annotations") 465 * units can be "items" (= offsets in character or token-stream) or "timestamp" (timestamps in audio-stream), timestamps may have a resolution of up to 1/1000 second. 466 * endpoints are responsible for choosing proper offsets for segments. they must do so in a consistent manner, i.e. in a single result (= ADV Data View instance) the chosen offsets must allow for aligning the segments of different layers. a recommendation for character streams: character := Unicode codepoint, normalized to Unicode Normalization Form KC (NFKC; Compatibility Decomposition, followed by Canonical Composition) 467 * segments may also have an endpoint specific reference (= URI); can be show in aggregator and if user clicks link can open a viewer (e.g. audio-player) at the endpoint 468 * a list of layers, each has a type (e.g. "pos", "lemma", see Layer Type identifier in section Layers above) and an layer identifier (= URI) 469 * a layer consists of one or more Spans. A span references a segment (and thus inherits the start- and en- offsets) and contains the actual annotation (e.g. the port-of-speech label) in it's content; MAY also contain alt-value (e.g. original annotation value) 470 * document order of layer elements define the view order in the Aggregator 471 * endpoints should return at least all layers that where referenced the query; they may return more 472 * Hit Makers are added by marking Spans as hits (add `@highlight` attribute); multiple hit-makers are supported and Aggregator may display them visually distinct 473 * where to add hit markers is up to the endpoint; generally "things" that where referenced in the query should be marked. 616 474 617 475 Example: a sentence interpreted as a character stream 618 ||=Data =|| t || || d || a || || ' || s || || d || e || || e || n || i || g || e || || e || c || h || t || e || || h || o || o || p || || v || o || o || r || || o || n || s || || m || e || n || s || e || n || 476 477 ||=Data =|| t || || d || a || || ' || s || || d || e || || e || n || i || g || e || || e || c || h || t || e || || h || o || o || p || || v || o || o || r || || o || n || s || || m || e || n || s || e || n || 619 478 ||=Offset =|| 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 || 12 || 13 || 14 || 15 || 16 || 17 || 18 || 19 || 20 || 21 || 22 || 23 || 24 || 25 || 26 || 27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 || 35 || 36 || 37 || 38 || 39 || 40 || 41 || 42 || 43 || 620 479 621 480 Example: several annotation layers for the sentence 622 ||=Offset (Start, End) =|| 1,1 || 3,4 || 6,7 || 9,10 || 12,16 || 18,22 || 24,27 || 29,32 || 34,36 || 38,43 || 623 ||=Layer ''orth'' =|| t || da || 's || de || enige || echte || hoop || voor || ons || mensen || 624 ||=Layer ''pos'' =|| X || PRON || VERB || DET || DET || ADJ || NOUN || ADP || PRON || NOUN || 625 ||=Layer ''lemma'' =|| _ || dat || zijn || de || enig || echt || hoop || voor || ons || mens || 626 ||=Layer ''phonetic'' =|| t@ || dAz || dAz || d@ || en@G@ || Ext@ || hop || for || Ons || mEns@ || 627 628 Example: XML serialization 629 {{{#!xml 630 <Advanced> 631 <Segments unit="items"> 632 <Segment id="s1" start="1" end="1" 633 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=0:173"/> 634 <Segment id="s2" start="3" end="4" 635 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/> 636 <Segment id="s3" start="6" end="7" 637 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/> 638 <Segment id="s4" start="9" end="10" 639 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=304:480"/> 640 <Segment id="s5" start="12" end="16" 641 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=480:1119"/> 642 <Segment id="s6" start="18" end="22" 643 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1339:1901"/> 644 <Segment id="s7" start="24" end="27" 645 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1901:2427"/> 646 <Segment id="s8" start="29" end="32" 647 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3084:3493"/> 648 <Segment id="s9" start="34" end="36" 649 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3493:3754"/> 650 <Segment id="s10" start="38" end="43" 651 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3754:4274"/> 652 </Segments> 653 654 <Layers> 655 <Layer id="http://endpoint.example.org/Layers/orth1"> 656 <Span ref="s1">t</Span> 657 <Span ref="s2">da</Span> 658 <Span ref="s3">'s</Span> 659 <Span ref="s4">de</Span> 660 <Span ref="s5">enige</Span> 661 <Span ref="s6">echte</Span> 662 <Span ref="s7">hoop</Span> 663 <Span ref="s8">voor</Span> 664 <Span ref="s9">ons</Span> 665 <Span ref="s10">mensen</Span> 666 </Layer> 667 668 <Layer id="http://endpoint.example.org/Layers/pos1"> 669 <Span ref="s1" alt-value="SPEC(afgebr)">X</Span> 670 <Span ref="s2" alt-value="VNW(aanw,pron,stan,vol,3o,ev)">PRON</Span> 671 <Span ref="s3" alt-value="WW(pv,tgw,ev)">VERB</Span> 672 <Span ref="s4" alt-value="LID(bep,stan,rest)">DET</Span> 673 <Span ref="s5" alt-value="VNW(onbep,det,stan,prenom,met-e,rest)">DET</Span> 674 <Span ref="s6" alt-value="ADJ(prenom,basis,met-e,stan)">ADJ</Span> 675 <Span ref="s7" alt-value="N(soort,ev,basis,zijd,stan)">NOUN</Span> 676 <Span ref="s8" alt-value="VZ(init)">ADP</Span> 677 <Span ref="s9" alt-value="VNW(pr,pron,obl,vol,1,mv)">PRON</Span> 678 <Span ref="s10" alt-value="N(soort,mv,basis)">NOUN</Span> 679 </Layer> 680 681 <Layer id="http://endpoint.example.org/Layers/lemma1"> 682 <Span ref="s1">_</Span> 683 <Span ref="s2">dat</Span> 684 <Span ref="s3">zijn</Span> 685 <Span ref="s4" >de</Span> 686 <Span ref="s5">enig</Span> 687 <Span ref="s6" highlight="h1">echt</Span> 688 <Span ref="s7" highlight="h1">hoop</Span> 689 <Span ref="s8">voor</Span> 690 <Span ref="s9">ons</Span> 691 <Span ref="s10">mens</Span> 692 </Layer> 693 694 <Layer id="http://endpoint.example.org/Layers/phon"> 695 <Span ref="s1">t@</Span> 696 <Span ref="s2" highlight="h2">dAz</Span> 697 <Span ref="s3">dAz</Span> 698 <Span ref="s4">d@</Span> 699 <Span ref="s5">en@G@</Span> 700 <Span ref="s6">Ext@</Span> 701 <Span ref="s7">hop</Span> 702 <Span ref="s8">for</Span> 703 <Span ref="s9">Ons</Span> 704 <Span ref="s10">mEns@</Span> 705 </Layer> 706 </Layers> 707 </Advanced> 708 }}} 709 710 === Versioning and Extensions 711 ==== Backwards Compatibility #backwardsCompatibility 481 482 ||=Offset (Start, End) =|| 1,1 || 3,4 || 6,7 || 9,10 || 12,16 || 18,22 || 24,27 || 29,32 || 34,36 || 38,43 || 483 ||=Layer ''orth'' =|| t || da || 's || de || enige || echte || hoop || voor || ons || mensen || 484 ||=Layer ''pos'' =|| X || PRON || VERB || DET || DET || ADJ || NOUN || ADP || PRON || NOUN || 485 ||=Layer ''lemma'' =|| _ || dat || zijn || de || enig || echt || hoop || voor || ons || mens || 486 ||=Layer ''phonetic'' =|| t@ || dAz || dAz || d@ || en@G@ || Ext@ || hop || for || Ons || mEns@ || 487 488 Example: XML serialization {{{#!xml <Advanced> 489 490 <Segments unit="items"> 491 <Segment id="s1" start="1" end="1" 492 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=0:173"/ > 493 <Segment id="s2" start="3" end="4" 494 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/ > 495 <Segment id="s3" start="6" end="7" 496 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/ > 497 <Segment id="s4" start="9" end="10" 498 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=304:480"/ > 499 <Segment id="s5" start="12" end="16" 500 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=480:1119"/ > 501 <Segment id="s6" start="18" end="22" 502 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1339:1901"/ > 503 <Segment id="s7" start="24" end="27" 504 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1901:2427"/ > 505 <Segment id="s8" start="29" end="32" 506 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3084:3493"/ > 507 <Segment id="s9" start="34" end="36" 508 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3493:3754"/ > 509 <Segment id="s10" start="38" end="43" 510 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3754:4274"/ > 511 </Segments> 512 513 <Layers> 514 <Layer id="http://endpoint.example.org/Layers/orth1 "> 515 <Span ref="s1">t</Span> <Span ref="s2">da</Span> <Span ref="s3">'s</Span> <Span ref="s4">de</Span> <Span ref="s5">enige</Span> <Span ref="s6">echte</Span> <Span ref="s7">hoop</Span> <Span ref="s8">voor</Span> <Span ref="s9">ons</Span> <Span ref="s10">mensen</Span> 516 </Layer> 517 518 <Layer id="http://endpoint.example.org/Layers/pos1 "> 519 <Span ref="s1" alt-value="SPEC(afgebr)">X</Span> <Span ref="s2" alt-value="VNW(aanw,pron,stan,vol,3o,ev)">PRON</Span> <Span ref="s3" alt-value="WW(pv,tgw,ev)">VERB</Span> <Span ref="s4" alt-value="LID(bep,stan,rest)">DET</Span> <Span ref="s5" alt-value="VNW(onbep,det,stan,prenom,met-e,rest)">DET</Span> <Span ref="s6" alt-value="ADJ(prenom,basis,met-e,stan)">ADJ</Span> <Span ref="s7" alt-value="N(soort,ev,basis,zijd,stan)">NOUN</Span> <Span ref="s8" alt-value="VZ(init)">ADP</Span> <Span ref="s9" alt-value="VNW(pr,pron,obl,vol,1,mv)">PRON</Span> <Span ref="s10" alt-value="N(soort,mv,basis)">NOUN</Span> 520 </Layer> 521 522 <Layer id="http://endpoint.example.org/Layers/lemma1 "> 523 <Span ref="s1">_</Span> <Span ref="s2">dat</Span> <Span ref="s3">zijn</Span> <Span ref="s4" >de</Span> <Span ref="s5">enig</Span> <Span ref="s6" highlight="h1">echt</Span> <Span ref="s7" highlight="h1">hoop</Span> <Span ref="s8">voor</Span> <Span ref="s9">ons</Span> <Span ref="s10">mens</Span> 524 </Layer> 525 526 <Layer id="http://endpoint.example.org/Layers/phon "> 527 <Span ref="s1">t@</Span> <Span ref="s2" highlight="h2">dAz</Span> <Span ref="s3">dAz</Span> <Span ref="s4">d@</Span> <Span ref="s5">en@G@</Span> <Span ref="s6">Ext@</Span> <Span ref="s7">hop</Span> <Span ref="s8">for</Span> <Span ref="s9">Ons</Span> <Span ref="s10">mEns@</Span> 528 </Layer> 529 530 </Layers> </Advanced> }}} 531 532 === Versioning and Extensions === 533 ==== Backwards Compatibility ==== #backwardsCompatibility 712 534 {{{ 713 535 #!div style="border: 1px solid #000000; font-size: 75%" 714 536 TODO: check and proof-read 715 537 }}} 716 717 538 Clients `MUST` be compatible to CLARIN-FCS 1.0, thus must implement SRU 1.2. If a Client uses CLARIN-FCS 1.0 to talk to an Endpoint, it `MUST NOT` use features beyond the Basic Search capability. Clients `MUST` implement a heuristic to automatically determine which CLARIN-FCS protocol version, i.e. which version of the SRU protocol, can be used talk an Endpoint. 718 539 719 540 Clients `MUST` be able to process the legacy XML namespaces: 541 720 542 * `http://www.loc.gov/zing/srw/` for SRU response documents, and 721 543 * `http://www.loc.gov/zing/srw/diagnostic/` for diagnostics within SRU response documents. 544 722 545 which SRU 1.2 Endpoints use for serializing responses as well as the OASIS XML namespaces. CLARIN-FCS deviates from the OASIS specification [#REF_SRU_Overview OASIS-SRU-Overview] and [#REF_SRU_12 OASIS-SRU-12] to ensure backwards comparability with SRU 1.2 services as they were defined by the [#REF_LOC_SRU_12 LOC-SRU12]. 723 546 724 547 Pseudo algorithm for version detection heuristic: 548 725 549 * Send ''explain'' request without `version` and `operation` parameter 726 550 * Check SRU response for content of the element `<sru:explainResponse>/<sru:version>` 727 551 728 ==== Endpoint Custom Extensions 729 Endpoints can add custom extensions, i.e. custom data, to the Result Format. This extension mechanism can for example be used to provide hints for an (XSLT/XQuery) application that works directly on CLARIN-FCS, e.g. to allow it to generate back and forward links to navigate in a result set. 552 ==== Endpoint Custom Extensions ==== 553 Endpoints can add custom extensions, i.e. custom data, to the Result Format. This extension mechanism can for example be used to provide hints for an (XSLT/XQuery) application that works directly on CLARIN-FCS, e.g. to allow it to generate back and forward links to navigate in a result set. 730 554 731 555 An Endpoint `MAY` add arbitrary XML fragments to the extension hooks provided in the `<fcs:Resource>` element (see the XML schema for "Resource.xsd"). The XML fragment for the extension `MUST` use a custom XML namespace name for the extension. Endpoints `MUST NOT` use XML namespace names that start with the prefixes `http://clarin.eu`, `http://www.clarin.eu/`, `https://clarin.eu` or `https://www.clarin.eu/`. … … 735 559 The non-normative appendix contains an [#extensionExample example], how an extension could be implemented. 736 560 737 = CLARIN-FCS to SRU/CQL binding 738 == SRU/CQL #sruCQL561 = CLARIN-FCS to SRU/CQL binding = 562 == SRU/CQL == #sruCQL 739 563 {{{ 740 564 #!div style="border: 1px solid #000000; font-size: 75%" … … 744 568 745 569 Endpoints and Clients `MUST` implement the SRU/CQL protocol suite as defined in [#REF_SRU_Overview OASIS-SRU-Overview], [#REF_SRU_APD OASIS-SRU-APD], [#REF_CQL OASIS-CQL], [#REF_Explain SRU-Explain], [#REF_Scan SRU-Scan], especially with respect to: 570 746 571 * Data Model, 747 572 * Query Model, 748 573 * Processing Model, 749 574 * Result Set Model, and 750 * Diagnostics Model 751 752 Endpoints and Clients `MUST` implement the APD Binding for SRU 2.0, as defined in [#REF_SRU_20 OASIS-SRU-20]. \\ 753 Clients `MUST` implement APD Binding for SRU 1.2, as defined in [#REF_SRU_12 OASIS-SRU-12]. \\ 754 Clients `MAY` also implement APD binding for version 1.1. \\ 755 '''NOTE''': when implementing SRU 1.2 Endpoints and Clients `MUST` behave like described in the section [#backwardsCompatibility Backwards Compatibility]. 575 * Diagnostics Model 576 577 Endpoints and Clients `MUST` implement the APD Binding for SRU 2.0, as defined in [#REF_SRU_20 OASIS-SRU-20]. \\ Clients `MUST` implement APD Binding for SRU 1.2, as defined in [#REF_SRU_12 OASIS-SRU-12]. \\ Clients `MAY` also implement APD binding for version 1.1. \\ '''NOTE''': when implementing SRU 1.2 Endpoints and Clients `MUST` behave like described in the section [#backwardsCompatibility Backwards Compatibility]. 756 578 757 579 Endpoints or Clients `MUST` support CQL conformance ''Level 2'' (as defined in [#REF_OASIS_CQL OASIS-CQL, section 6]), i.e. be able to ''parse'' (Endpoints) or ''serialize'' (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface. … … 763 585 Endpoints `MUST` support the HTTP GET [#REF_SRU_20 OASIS-SRU-20, Appendix B.1] and HTTP POST [#REF_SRU_20 OASIS-SRU-20, Appendix B.2] lower level protocol binding. Endpoints `MAY` also support the SOAP [#REF_SRU_20 OASIS-SRU-20, Appendix B.3] binding. 764 586 765 766 == Operation ''explain'' #explain 587 == Operation ''explain'' == #explain 767 588 {{{ 768 589 #!div style="border: 1px solid #000000; font-size: 75%" … … 774 595 775 596 According to the Capabilities supported by the Endpoint the Explain record `MUST` contain the following elements: 776 ''Basic-Search'' Capability:: 777 `<zr:serverInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ 778 `<zr:databaseInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ 779 `<zr:schemaInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`). This element `MUST` contain an element `<zr:schema>` with an `@identifier` attribute with a value of `http://clarin.eu/fcs/resource` and an `@name` attribute with a value of `fcs`. \\ 780 `<zr:configInfo>` is `OPTIONAL` \\ 781 Other capabilities may define how the `<zr:indexInfo>` element is to be used, therefore it is `NOT RECOMMENDED` for Endpoints to use it in custom extensions. 597 598 ''Basic-Search'' Capability:: `<zr:serverInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ `<zr:databaseInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ `<zr:schemaInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`). This element `MUST` contain an element `<zr:schema>` with an `@identifier` attribute with a value of `http://clarin.eu/fcs/resource` and an `@name` attribute with a value of `fcs`. \\ `<zr:configInfo>` is `OPTIONAL` \\ Other capabilities may define how the `<zr:indexInfo>` element is to be used, therefore it is `NOT RECOMMENDED` for Endpoints to use it in custom extensions. 782 599 783 600 To support auto-configuration in CLARIN-FCS, the Endpoint `MUST` provide support ''Endpoint Description''. The Endpoint Description is included in explain response utilizing SRUs extension mechanism, i.e. by embedding an XML fragment into the `<sru:extraResponseData>` element. The Endpoint `MUST` include the Endpoint Description ''only'' if the Client performs an explain request with the ''extra request parameter'' `x-fcs-endpoint-description` with a value of `true`. If the Client performs an explain request ''without'' supplying this extra request parameter the Endpoint `MUST NOT` include the Endpoint Description. The format of the Endpoint Description XML fragment is defined in [#endpointDescription Endpoint Description]. 784 601 785 602 The following example shows a request and response to an ''explain'' request with added extra request parameter `x-fcs-endpoint-description`: 786 * HTTP GET request: Client → Endpoint: 787 {{{#!sh 788 http://repos.example.org/fcs-endpoint?operation=explain&version=1.2&x-fcs-endpoint-description=true 789 }}} 790 * HTTP Response: Endpoint → Client: 791 {{{#!xml 792 <?xml version='1.0' encoding='utf-8'?> 793 <sru:explainResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 794 <sru:version>1.2</sru:version> 795 <sru:record> 796 <sru:recordSchema>http://explain.z3950.org/dtd/2.0/</sru:recordSchema> 797 <sru:recordPacking>xml</sru:recordPacking> 798 <sru:recordData> 603 604 * HTTP GET request: Client → Endpoint: 605 606 {{{#!sh http://repos.example.org/fcs-endpoint?operation=explain&version=1.2&x-fcs-endpoint-description=true }}} 607 608 * HTTP Response: Endpoint → Client: 609 610 {{{#!xml <?xml version='1.0' encoding='utf-8'?> <sru:explainResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 611 612 <sru:version> 1.2</sru:version > <sru:record> 613 <sru:recordSchema> http://explain.z3950.org/dtd/2.0/ </sru:recordSchema > <sru:recordPacking> xml</sru:recordPacking > <sru:recordData> 799 614 <zr:explain xmlns:zr="http://explain.z3950.org/dtd/2.0/"> 800 <!-- <zr:serverInfo > is REQUIRED --> 801 <zr:serverInfo protocol="SRU" version="1.2" transport="http"> 802 <zr:host>repos.example.org</zr:host> 803 <zr:port>80</zr:port> 804 <zr:database>fcs-endpoint</zr:database> 805 </zr:serverInfo> 806 <!-- <zr:databaseInfo> is REQUIRED --> 807 <zr:databaseInfo> 808 <zr:title lang="de">Goethe Corpus</zr:title> 809 <zr:title lang="en" primary="true">Goethe Korpus</zr:title> 810 <zr:description lang="de">Der Goethe Korpus des IDS Mannheim.</zr:description> 811 <zr:description lang="en" primary="true">The Goethe corpus of IDS Mannheim.</zr:description> 812 </zr:databaseInfo> 813 <!-- <zr:schemaInfo> is REQUIRED --> 814 <zr:schemaInfo> 615 <!-- <zr:serverInfo> is REQUIRED --> <zr:serverInfo protocol="SRU" version="1.2" transport="http"> 616 <zr:host> repos.example.org</zr:host > <zr:port> 80</zr:port > <zr:database> fcs-endpoint</zr:database > 617 </zr:serverInfo > <!-- <zr:databaseInfo> is REQUIRED --> <zr:databaseInfo> 618 <zr:title lang="de"> Goethe Corpus</zr:title > <zr:title lang="en" primary="true"> Goethe Korpus</zr:title > <zr:description lang="de"> Der Goethe Korpus des IDS Mannheim.</zr:description > <zr:description lang="en" primary="true"> The Goethe corpus of IDS Mannheim.</zr:description > 619 </zr:databaseInfo > <!-- <zr:schemaInfo> is REQUIRED --> <zr:schemaInfo> 815 620 <zr:schema identifier="http://clarin.eu/fcs/resource" name="fcs"> 816 <zr:title lang="en" primary="true">CLARIN Federated Content Search</zr:title> 817 </zr:schema> 818 </zr:schemaInfo> 819 <!-- <zr:configInfo> is OPTIONAL --> 820 <zr:configInfo> 821 <zr:default type="numberOfRecords">250</zr:default> 822 <zr:setting type="maximumRecords">1000</zr:setting> 823 </zr:configInfo> 824 </zr:explain> 825 </sru:recordData> 826 </sru:record> 827 <!-- <sru:echoedExplainRequest> is OPTIONAL --> 828 <sru:echoedExplainRequest> 829 <sru:version>1.2</sru:version> 830 <sru:baseUrl>http://repos.example.org/fcs-endpoint</sru:baseUrl> 831 </sru:echoedExplainRequest> 832 <sru:extraResponseData> 621 <zr:title lang="en" primary="true"> CLARIN Federated Content Search</zr:title > 622 </zr:schema > 623 </zr:schemaInfo > <!-- <zr:configInfo> is OPTIONAL --> <zr:configInfo> 624 <zr:default type="numberOfRecords"> 250</zr:default > <zr:setting type="maximumRecords"> 1000</zr:setting > 625 </zr:configInfo > 626 </zr:explain > 627 </sru:recordData > 628 </sru:record > <!-- <sru:echoedExplainRequest> is OPTIONAL --> <sru:echoedExplainRequest> 629 <sru:version> 1.2</sru:version > <sru:baseUrl> http://repos.example.org/fcs-endpoint </sru:baseUrl > 630 </sru:echoedExplainRequest > <sru:extraResponseData> 833 631 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="1"> 834 632 <ed:Capabilities> 835 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 836 </ed:Capabilities> 837 <ed:SupportedDataViews> 838 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 839 </ed:SupportedDataViews> 840 <ed:Resources> 841 <!-- just one top-level resource at the Endpoint --> 842 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 843 <ed:Title xml:lang="de">Goethe Corpus</ed:Title> 844 <ed:Title xml:lang="en">Goethe Korpus</ed:Title> 845 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 846 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 847 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 848 <ed:Languages> 849 <ed:Language>deu</ed:Language> 850 </ed:Languages> 851 <ed:AvailableDataViews ref="hits"/> 852 </ed:Resource> 853 </ed:Resources> 854 </ed:EndpointDescription> 855 </sru:extraResponseData> 856 </sru:explainResponse> 857 }}} 858 859 == Operation ''scan'' #scan 633 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > 634 </ed:Capabilities > <ed:SupportedDataViews> 635 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > 636 </ed:SupportedDataViews > <ed:Resources> 637 <!-- just one top-level resource at the Endpoint --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 638 <ed:Title xml:lang="de"> Goethe Corpus</ed:Title > <ed:Title xml:lang="en"> Goethe Korpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 639 <ed:Language> deu</ed:Language > 640 </ed:Languages > <ed:AvailableDataViews ref="hits"/> 641 </ed:Resource > 642 </ed:Resources > 643 </ed:EndpointDescription > 644 </sru:extraResponseData > 645 646 </sru:explainResponse> }}} 647 648 == Operation ''scan'' == #scan 860 649 The ''scan'' operation of the SRU protocol is currently not used in the ''Basic Search'' or ''Advanced Search'' capability of CLARIN-FCS. Future capabilities may use this operation, therefore it is `NOT RECOMMENDED` for Endpoints to define custom extensions that use this operation. 861 650 862 == Operation ''searchRetrieve'' #searchRetrieve651 == Operation ''searchRetrieve'' == #searchRetrieve 863 652 {{{ 864 653 #!div style="border: 1px solid #000000; font-size: 75%" … … 868 657 The ''searchRetrieve'' operation of the SRU protocol is used for searching in the Resources that are provided by the Endpoint. The SRU protocol defines the serialization of request and response formats in [#REF_SRU_20 OASIS-SRU-20] for SRU version 2.0 and [#REF_SRU_12 OASIS-SRU-12] for SRU version 1.2. An Endpoint `MUST` respond in the correct format, i.e. when Endpoint also supports SRU 1.2 and the request is issued in SRU version 1.2, the response must be encoded accordingly. 869 658 870 In SRU, search result hits are encoded down to a record level, i.e. the `<sru:record>` element, and SRU allows records to be serialized in various formats, so called ''record schemas'' Endpoints `MUST` support the CLARIN-FCS record schema (see section [#resultFormat Result Format]) and `MUST` use the value `http://clarin.eu/fcs/resource` for the ''responseItemType'' ("record schema identifier"). 871 Endpoints `MUST` represent exactly ''one hit'' within the Resource as one SRU record, i.e. `<sru:record>` element. 659 In SRU, search result hits are encoded down to a record level, i.e. the `<sru:record>` element, and SRU allows records to be serialized in various formats, so called ''record schemas'' Endpoints `MUST` support the CLARIN-FCS record schema (see section [#resultFormat Result Format]) and `MUST` use the value `http://clarin.eu/fcs/resource` for the ''responseItemType'' ("record schema identifier"). Endpoints `MUST` represent exactly ''one hit'' within the Resource as one SRU record, i.e. `<sru:record>` element. 872 660 873 661 The following example shows a request and response to a ''searchRetrieve'' request with a ''term-only'' query for "cat": 874 * HTTP GET request: Client → Endpoint: 875 {{{#!sh 876 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat 877 }}} 878 * HTTP Response: Endpoint → Client: 879 {{{#!xml 880 <?xml version='1.0' encoding='utf-8'?> 881 <sru:searchRetrieveResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 882 <sru:version>1.2</sru:version> 883 <sru:numberOfRecords>6</sru:numberOfRecords> 884 <sru:records> 662 663 * HTTP GET request: Client → Endpoint: 664 665 {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat }}} 666 667 * HTTP Response: Endpoint → Client: 668 669 {{{#!xml <?xml version='1.0' encoding='utf-8'?> <sru:searchRetrieveResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 670 671 <sru:version> 1.2</sru:version > <sru:numberOfRecords> 6</sru:numberOfRecords > <sru:records> 885 672 <sru:record> 886 <sru:recordSchema>http://clarin.eu/fcs/resource</sru:recordSchema> 887 <sru:recordPacking>xml</sru:recordPacking> 888 <sru:recordData> 673 <sru:recordSchema> http://clarin.eu/fcs/resource </sru:recordSchema > <sru:recordPacking> xml</sru:recordPacking > <sru:recordData> 889 674 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15"> 890 675 <fcs:ResourceFragment> 891 676 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 892 677 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 893 The quick brown <hits:Hit>cat</hits:Hit> jumps over the lazy dog. 894 </hits:Result> 895 </fcs:DataView> 896 </fcs:ResourceFragment> 897 </fcs:Resource> 898 </sru:recordData> 899 <sru:recordPosition>1</sru:recordPosition> 900 </sru:record> 901 <!-- more <sru:records> omitted for brevity --> 902 </sru:records> 903 <!-- <sru:echoedSearchRetrieveRequest> is OPTIONAL --> 904 <sru:echoedSearchRetrieveRequest> 905 <sru:version>1.2</sru:version> 906 <sru:query>cat</sru:query> 907 <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/"> 678 The quick brown <hits:Hit> cat</hits:Hit > jumps over the lazy dog. 679 </hits:Result > 680 </fcs:DataView > 681 </fcs:ResourceFragment > 682 </fcs:Resource > 683 </sru:recordData > <sru:recordPosition> 1</sru:recordPosition > 684 </sru:record > <!-- more <sru:records> omitted for brevity --> 685 </sru:records > <!-- <sru:echoedSearchRetrieveRequest> is OPTIONAL --> <sru:echoedSearchRetrieveRequest> 686 <sru:version> 1.2</sru:version > <sru:query> cat</sru:query > <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/"> 908 687 <searchClause> 909 <index>cql.serverChoice</index> 910 <relation> 688 <index>cql.serverChoice</index> <relation> 911 689 <value>=</value> 912 </relation> 913 <term>cat</term> 690 </relation> <term>cat</term> 914 691 </searchClause> 915 </sru:xQuery> 916 <sru:startRecord>1</sru:startRecord> 917 <sru:baseUrl>http://repos.example.org/fcs-endpoint</sru:baseUrl> 918 </sru:echoedSearchRetrieveRequest> 919 </sru:searchRetrieveResponse> 920 }}} 692 </sru:xQuery > <sru:startRecord> 1</sru:startRecord > <sru:baseUrl> http://repos.example.org/fcs-endpoint </sru:baseUrl > 693 </sru:echoedSearchRetrieveRequest > 694 695 </sru:searchRetrieveResponse> }}} 921 696 922 697 In general, the Endpoint is `REQUIRED` to accept an ''unrestricted search'' and `SHOULD` perform the search operation on ''all'' Resources that are available at the Endpoint. If that is for some reason not feasible, e.g. performing an unrestricted search would allocate too many resources, the Endpoint `MAY` independently restrict the search to a scope that it can handle. If it does so, it `MUST` issue a non-fatal diagnostics `http://clarin.eu/fcs/diagnostic/2` ("Resource set too large. Query context automatically adjusted."). The details field of diagnostics `MUST` contain the persistent identifier of the resources to which the query scope was limited to. If the Endpoint limits the query scope to more than one resource, it `MUST` generate a ''separate'' non-fatal diagnostic `http://clarin.eu/fcs/diagnostic/2` for each of the resources. … … 926 701 The Client can extract all valid persistent identifiers from the `@pid` attribute of the `<ed:Resource>` element, obtained by the ''explain'' request (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). The list of persistent identifiers can get extensive, but a Client can use the HTTP POST method instead of HTTP GET method for submitting the request. 927 702 928 For example, to restrict the search to the Resource with the persistent identifier `http://hdl.handle.net/4711/0815` the Client must issue the following request: 929 {{{#!sh 930 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815 931 }}} 932 To restrict the search to the Resources with the persistent identifier `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816-2` the Client must issue the following request: 933 {{{#!sh 934 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815,http://hdl.handle.net/4711/0816-2 935 }}} 936 If an invalid persistent identifier is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/1` diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search, or it `MAY` treat it as non-fatal and perform the search. 937 938 If a Client wants to request one or more Data Views, that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). 939 940 For example, to request the CMDI Data View from an Endpoint that has an Endpoint Description, as described in [#REF_Example_5 Example 5], a Client would need to use the ''Data View identifier'' `cmdi` and submit the following request: 941 {{{#!sh 942 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-dataviews=cmdi 943 }}} 944 If an invalid ''Data View identifier'' is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/4`diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. simply issue the diagnostic and perform no search, or it `MAY` treat it a non-fatal and perform the search. 945 946 947 = Normative Appendix 703 For example, to restrict the search to the Resource with the persistent identifier `http://hdl.handle.net/4711/0815` the Client must issue the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815 }}} To restrict the search to the Resources with the persistent identifier `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816-2` the Client must issue the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815,http://hdl.handle.net/4711/0816-2 }}} If an invalid persistent identifier is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/1` diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search, or it `MAY` treat it as non-fatal and perform the search. 704 705 If a Client wants to request one or more Data Views, that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). 706 707 For example, to request the CMDI Data View from an Endpoint that has an Endpoint Description, as described in [#REF_Example_5 Example 5], a Client would need to use the ''Data View identifier'' `cmdi` and submit the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-dataviews=cmdi }}} If an invalid ''Data View identifier'' is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/4`diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. simply issue the diagnostic and perform no search, or it `MAY` treat it a non-fatal and perform the search. 708 709 = Normative Appendix = 948 710 {{{ 949 711 #!div style="border: 1px solid #000000; font-size: 75%" 950 712 TODO: check and proof-read all sub-sections. 951 713 }}} 952 == List of extra request parameters 714 == List of extra request parameters == 953 715 The following extra request parameters are used in CLARIN-FCS. The column ''SRU operations'' lists the SRU operation, for which this extra request parameter is to be used. Clients `MUST NOT` use the parameter for an operation that is not listed in this column. However, if a Client sends an invalid parameter, an Endpoint `SHOULD` issue a fatal diagnostic "Unsupported Parameter" (`info:srw/diagnostic/1/8`) and stop processing the request. Alternatively, an Endpoint `MAY` silently ignore the invalid parameter. 954 ||=Parameter Name =||=SRU operations =||=Allowed values =||=Description =|| 716 717 ||=Parameter Name =||=SRU operations =||=Allowed values =||=Description =|| 955 718 || `x-fcs-endpoint-description` || explain || `true`; all other values are reserved and `MUST` not be used by Clients || If the parameter is given (with the value `true`), the Endpoint `MUST` include an Endpoint Description in the `<sru:extraResponseData>` element of the ''explain'' response. || 956 719 || `x-fcs-context` || searchRetrieve || A comma-separated list of persistent identifiers || The Endpoint `MUST` restrict the search to the resources identified by the persistent identifiers. || … … 958 721 || `x-fcs-rewrites-allowed` || searchRetrieve || `true`; all other values are reserved and `MUST` not be used by Clients. \\ Clients `MUST` only use this parameter when performing an Advanced Search request. || If the parameter is given (with the value `true`), the Endpoint `MAY` rewrite the query to a simpler query to allow for more recall. || 959 722 960 == List of diagnostics 723 == List of diagnostics == 961 724 {{{ 962 725 #!div style="border: 1px solid #000000; font-size: 75%" … … 964 727 }}} 965 728 Apart from the SRU diagnostics defined in [#REF_SRU_12 OASIS-SRU-12, Appendix C] and [#REF_LOC_DIAG LOC-DIAG], the following diagnostics are used in CLARIN-FCS. The column "Details Format" specifies what `SHOULD` be returned in the details field. If this column is blank, the format is "undefined" and the Endpoint `MAY` return whatever it feels appropriate, including nothing. The column "Impact" specifies, if the endpoint should continue ("non-fatal") or should stop ("fatal") processing. 966 ||=Identifier URI =||=Description =||=Details Format =||=Impact =||=Note =|| 729 730 ||=Identifier URI =||=Description =||=Details Format =||=Impact =||=Note =|| 967 731 || `http://clarin.eu/fcs/diagnostic/1` || Persistent identifier passed by the Client for restricting the search is invalid. || The offending persistent identifier. || non-fatal || If more than one invalid persistent identifiers were submitted by the Client, the Endpoint `MUST` generate a separate diagnostic for each invalid persistent identifier. || 968 732 || `http://clarin.eu/fcs/diagnostic/2` || Resource set too large. Query context automatically adjusted. || The persistent identifier of the resource to which the query context was adjusted. || non-fatal || If an Endpoint limited the query context to more than one resource, it `MUST` generate a separate diagnostic for each resource to which the query context was adjusted. || 969 733 || `http://clarin.eu/fcs/diagnostic/3` || Resource set too large. Cannot perform Query. || || fatal || || 970 734 || `http://clarin.eu/fcs/diagnostic/4` || Requested Data View not valid for this resource. || The Data View MIME type. || non-fatal || If more than one invalid Data View was requested, the Endpoint `MUST` generate a separate diagnostic for each invalid Data View. || 971 || `http://clarin.eu/fcs/diagnostic/10` || General query syntax error. 735 || `http://clarin.eu/fcs/diagnostic/10` || General query syntax error. || Detailed error message why the query could not be parsed. || fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. || 972 736 || `http://clarin.eu/fcs/diagnostic/11` || Query too complex. Cannot perform Query. || Details why could not be performed, e.g. unsupported layer or unsupported combination of operators. || fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. || 973 737 || `http://clarin.eu/fcs/diagnostic/12` || Query was rewritten. || Details how the query was rewritten. || non-fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request with the `x-fcs-rewrites-allowed` request parameter. || 974 738 || `http://clarin.eu/fcs/diagnostic/14` || General processing hint. || E.g. "No matches, because layer 'XY' is not available in your selection of resources" || non-fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. || 975 739 976 == CLARIN FCS-QL Grammar Specification #fcsQLEBNF740 == CLARIN FCS-QL Grammar Specification == #fcsQLEBNF 977 741 The version of the CLARIN FCS-QL is tied to the FCS Core version starting with version 2.0. 978 742 979 The grammar specification for the FCS-QL is heavily based on Poliqarp but also with inspiration from other query languages' grammars. 980 An unqualified or qualified "attribute" denotes the annotation layer to be used, e.g. unqualified "word", "lemma", "pos" or qualified "pos:stts". Default is "text" for compatibility with FCS 1.0 where simple wordforms in a pair of single or double quotes can be matched. 743 The grammar specification for the FCS-QL is heavily based on Poliqarp but also with inspiration from other query languages' grammars. An unqualified or qualified "attribute" denotes the annotation layer to be used, e.g. unqualified "word", "lemma", "pos" or qualified "pos:stts". Default is "text" for compatibility with FCS 1.0 where simple wordforms in a pair of single or double quotes can be matched. 981 744 982 745 === FCS-QL EBNF === 983 746 {{{#!comment 747 984 748 Please keep the EBNF nicely formatted. Thanks! 985 }}} 749 750 }}} 751 986 752 {{{ 987 753 [1] query ::= main-query within-part? … … 1092 858 === Notes === 1093 859 * "simple-within-scope": possible values for scope 1094 * 860 * "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.") 1095 861 * "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph 1096 862 * "article" | "session": something like a whole document 1097 * {{{[25]}}} and {{{[26]}}} "any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with {{{[27]}}}863 * `[25]` and `[26]` "any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with `[27]` 1098 864 * regex are not defined/guarded by this grammar 1099 865 1100 = Non-normative Appendix 866 = Non-normative Appendix = 1101 867 {{{ 1102 868 #!div style="border: 1px solid #000000; font-size: 75%" 1103 869 TODO: check and proof-read all sub-sections. 1104 870 }}} 1105 == Syntax variant for Handle system Persistent Identifier URIs 871 == Syntax variant for Handle system Persistent Identifier URIs == 1106 872 Persistent Identifiers from the Handle system are defined in two syntax variants: a regular URI format for the Handle protocol, i.e. with a `hdl:` prefix, or ''actionable'' URIs with a `http://hdl.handle.net/` prefix. Generally, CLARIN software should support both syntax variants, therefore the CLARIN-FCS Interface Specification does not endorse a specific syntax variant. However, Endpoints are recommended to use the ''actionable'' syntax variant. 1107 873 1108 == Referring to an Endpoint from a CMDI record 1109 Centers are encouraged to provide links to their CLARIN-FCS Endpoints in the metadata records for their resources. Other services, like the VLO, can use this information for automatically configuring an Aggregator for searching resources at the Endpoint. 1110 1111 To refer to an Endpoint, a `<cmdi:ResourceProxy>` element with child-element `<cmdi:ResourceType>` set to the value `SearchService` and a `@mimetype` attribute with a value of `application/sru+xml` need to be added to the CMDI record. The content of the `<cmdi:ResourceRef>` element must contain a URI that points to the Endpoint web service. 1112 1113 Example: 1114 {{{#!xml 1115 <cmdi:CMD xmlns:cmdi="http://www.clarin.eu/cmd/" CMDVersion="1.1"> 874 == Referring to an Endpoint from a CMDI record == 875 Centers are encouraged to provide links to their CLARIN-FCS Endpoints in the metadata records for their resources. Other services, like the VLO, can use this information for automatically configuring an Aggregator for searching resources at the Endpoint. To refer to an Endpoint, a `<cmdi:ResourceProxy>` element with child-element `<cmdi:ResourceType>` set to the value `SearchService` and a `@mimetype` attribute with a value of `application/sru+xml` need to be added to the CMDI record. The content of the `<cmdi:ResourceRef>` element must contain a URI that points to the Endpoint web service. 876 877 Example: {{{#!xml <cmdi:CMD xmlns:cmdi="http://www.clarin.eu/cmd/" CMDVersion="1.1"> 878 1116 879 <cmdi:Header> 1117 <!-- ... --> 1118 <cmdi:MdSelfLink>http://hdl.handle.net/4711/0815</cmdi:MdSelfLink> 1119 <!-- ... --> 1120 </cmdi:Header> 1121 <cmdi:Resources> 880 <!-- ... --> <cmdi:MdSelfLink> http://hdl.handle.net/4711/0815 </cmdi:MdSelfLink > <!-- ... --> 881 </cmdi:Header > <cmdi:Resources> 1122 882 <cmdi:ResourceProxyList> 1123 <!-- ... --> 1124 <cmdi:ResourceProxy id="r4711"> 1125 <cmdi:ResourceType mimetype="application/sru+xml">SearchService</cmdi:ResourceType> 1126 <cmdi:ResourceRef>http://repos.example.org/fcs-endpoint</cmdi:ResourceRef> 1127 </cmdi:ResourceProxy> 1128 <!-- ... --> 1129 </cmdi:ResourceProxyList> 1130 </cmdi:Resources> 1131 <!-- ... --> 1132 </cmdi:CMD> 1133 }}} 1134 1135 == Endpoint custom extensions #extensionExample 883 <!-- ... --> <cmdi:ResourceProxy id="r4711"> 884 <cmdi:ResourceType mimetype="application/sru+xml"> SearchService </cmdi:ResourceType > <cmdi:ResourceRef> http://repos.example.org/fcs-endpoint </cmdi:ResourceRef > 885 </cmdi:ResourceProxy > <!-- ... --> 886 </cmdi:ResourceProxyList > 887 </cmdi:Resources > <!-- ... --> 888 889 </cmdi:CMD> }}} 890 891 == Endpoint custom extensions == #extensionExample 1136 892 The CLARIN-FCS protocol specification allows Endpoints to add custom data to their responses, e.g. to provide hints to an (XSLT/XQuery) application that works directly on CLARIN-FCS. It could use the custom data to generate back and forward links for a GUI to navigate in a result set. 1137 893 1138 The following example illustrates how extensions can be embedded into the Result Format: 1139 {{{#!xml 1140 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 1141 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 1142 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 1143 The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>. 1144 </hits:Result> 1145 </fcs:DataView> 1146 1147 <!-- 1148 NOTE: this is purely fictional and only serves to demonstrate how 1149 to add custom extensions to the result representation 1150 within CLARIN-FCS. 1151 --> 1152 1153 <!-- 1154 Example 1: a hypothetical Endpoint extension for navigation in a result 1155 set: it basically provides a set of hrefs, that a GUI can convert into 1156 navigation buttions. 1157 --> 1158 <nav:navigation xmlns:nav="http://repos.example.org/navigation"> 1159 <nav:curr href="http://repos.example.org/resultset/4711/4611" /> 1160 <nav:prev href="http://repos.example.org/resultset/4711/4610" /> 1161 <nav:next href="http://repos.example.org/resultset/4711/4612" /> 1162 </nav:navigation> 1163 1164 <!-- 1165 Example 2: a hypothetical Endpoint extension for directly referencing parent 1166 resources: it basically provides a link to the parent resource, that can be 1167 exploited by a GUI (e.g. build on XSLT/XQuery). 1168 --> 1169 <parent:Parent xmlns:parent="http://repos.example.org/parent" 1170 ref="http://repos.example.org/path/to/parent/1235.cmdi" /> 1171 </fcs:Resource> 1172 }}} 1173 1174 == Endpoint highlight hints for repositories 894 The following example illustrates how extensions can be embedded into the Result Format: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 895 896 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 897 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 898 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy <hits:Hit> dog</hits:Hit >. 899 </hits:Result > 900 </fcs:DataView > 901 902 <!-- 903 NOTE: this is purely fictional and only serves to demonstrate how 904 to add custom extensions to the result representation within CLARIN-FCS. 905 --> 906 907 <!-- 908 Example 1: a hypothetical Endpoint extension for navigation in a result set: it basically provides a set of hrefs, that a GUI can convert into navigation buttions. 909 --> <nav:navigation xmlns:nav="http://repos.example.org/navigation"> 910 <nav:curr href="http://repos.example.org/resultset/4711/4611" /> <nav:prev href="http://repos.example.org/resultset/4711/4610" /> <nav:next href="http://repos.example.org/resultset/4711/4612" /> 911 </nav:navigation > 912 913 <!-- 914 Example 2: a hypothetical Endpoint extension for directly referencing parent resources: it basically provides a link to the parent resource, that can be exploited by a GUI (e.g. build on XSLT/XQuery). 915 --> <parent:Parent xmlns:parent="http://repos.example.org/parent " 916 ref="http://repos.example.org/path/to/parent/1235.cmdi " /> 917 918 </fcs:Resource> }}} 919 920 == Endpoint highlight hints for repositories == 1175 921 An Aggregator can use the `@ref` attributes of the `<fcs:Resource>`, `<fcs:ResourceFragment>` or `<fcs:DataView>` elements to provide a link for the user to directly jump to the resource at a Repository. To support hit highlighting, an Endpoint can augment the URI in the `@ref` attribute with query parameters to implement hit highlighting in the Repository. 1176 922 1177 In the following example, the URI `http://repos.example.org/resource.cgi/<pid>` is a CGI script that displays a given resource at the Repository in HTML format and uses the `highlight` query parameter to add highlights to the resource. Of course, it's up to the Endpoint and the Repository, if and how they implement such a feature. 1178 {{{#!xml 1179 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 923 In the following example, the URI `http://repos.example.org/resource.cgi/<pid>` is a CGI script that displays a given resource at the Repository in HTML format and uses the `highlight` query parameter to add highlights to the resource. Of course, it's up to the Endpoint and the Repository, if and how they implement such a feature. {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 924 1180 925 <fcs:DataView type="application/x-clarin-fcs-hits+xml" ref="http://repos.example.org/resource.cgi/4711/0815?highlight=fox"> 1181 926 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 1182 The quick brown <hits:Hit> fox</hits:Hit> jumps over the lazy dog.1183 </hits:Result >1184 </fcs:DataView >1185 </fcs:Resource> 1186 }}}927 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy dog. 928 </hits:Result > 929 </fcs:DataView > 930 931 </fcs:Resource> }}}