Changes between Version 52 and Version 53 of Taskforces/FCS/FCS-Specification-Draft
- Timestamp:
- 06/09/17 10:00:42 (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Taskforces/FCS/FCS-Specification-Draft
v52 v53 4 4 }}} 5 5 [[PageOutline(1-6)]] 6 7 = CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0 = 8 = Introduction = 9 {{{ 10 #!div style="border: 1px solid #000000; font-size: 75%" 11 TODO: Proof-read/Check sub-sections. 12 }}} 6 = CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0 7 8 = Introduction 13 9 The goal of the ''CLARIN Federated Content Search (CLARIN-FCS) - Core'' specification is to introduce an ''interface specification'' that decouples the ''search engine'' functionality from its ''exploitation'', i.e. user-interfaces, third-party applications, and to allow services to access heterogeneous search engines in a uniform way. 14 10 15 == Terminology ==11 == Terminology 16 12 The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and `OPTIONAL` in this document are to be interpreted as described in [#REF_RFC_2119 RFC2119]. 17 13 18 == Glossary == 19 Aggregator:: A module or service to dispatch queries to repositories and collect results. 20 21 [=#REF_Annotation_Layer Annotation Layer]:: An annotation layer is the sum of possible annotations for a language resource, such as part of speech or orthographic transcription. Usually it is related to a given annotation task or topic. For the scope of the specification it is used as synonym for annotation tier. 22 23 CLARIN-FCS, FCS:: CLARIN federated content search, an interface specification to allow searching within resource content of repositories. 24 25 Client:: A software component, which implements the interface specification to query Endpoints, i.e. an aggregator or a user-interface. 26 27 CQL:: Contextual Query Language, previously known as Common Query Language, is a domain specific language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information. 28 29 Data View:: A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation. 30 31 Data View Payload, Payload:: The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation. 32 33 Endpoint:: A software component, which implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine. 34 35 FCS-QL:: Federated Content Search Query Language is the query language used in the advanced CLARIN-FCS profile. It is derived from Corpus Workbench's [#REF_CQP_Tutorial CQP-TUTORIAL] 36 37 Hit:: A piece of data returned by a Search Engine that matches the search criterion. What is considered a Hit highly depends on Search Engine. 38 39 Interface Specification:: Common harmonized interface and suite of protocols that repositories need to implement. 40 41 Layer:: See [#REF_Annotation_Layer "'Annotation Layer'"] 42 43 PID:: A Persistent identifier is a long-lasting reference to a digital object. 44 45 Repository:: A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata). 46 47 Repository Registry:: A separate service that allows registering Repositories and their Endpoints and provides information about these to other components, e.g. an Aggregator. The [http://centres.clarin.eu/ CLARIN Center Registry] is an implementation of such a repository registry. 48 49 Resource:: A searchable and addressable entity at an Endpoint, such as a text corpus or a multi-modal corpus. 50 51 Resource Fragment:: A smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription. 52 53 Result Set:: An (ordered) set of hits that match a search criterion produced by a search engine as the result of processing a query. 54 55 Search Engine:: A software component within a repository, that allows for searching within the repository contents. 56 57 SRU:: Search and Retrieve via URL, is a protocol for Internet search queries. Originally introduced by Library of Congress [#REF_LOC_SRU_12 LOC-SRU12], later standardization process moved to OASIS [#REF_SRU_12 OASIS-SRU12]. 58 59 == Normative References == 60 RFC2119[=#REF_RFC_2119]:: Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ [http://www.ietf.org/rfc/rfc2119.txt] 61 62 XML-Namespaces[=#REF_XML_Namespaces]:: Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009, \\ [http://www.w3.org/TR/2009/REC-xml-names-20091208/] 63 64 OASIS-SRU-Overview[=#REF_SRU_Overview]:: searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html (HTML)], [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.pdf (PDF)] 65 66 OASIS-SRU-APD[=#REF_SRU_APD]:: searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.pdf (PDF)] 67 68 OASIS-SRU12[=#REF_SRU_12]:: searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.pdf (PDF)] 69 70 OASIS-SRU20[=#REF_SRU_20]:: searchRetrieve: Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.pdf (PDF)] 71 72 OASIS-CQL[=#REF_CQL]:: searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.pdf (PDF)] 73 74 SRU-Explain[=#REF_Explain]:: searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.pdf (PDF)] 75 76 SRU-Scan[=#REF_Scan]:: searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013, \\ [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.html (HTML)] [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.PDF (PDF)] 77 78 LOC-SRU12[=#REF_LOC_SRU_12]:: SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ [http://www.loc.gov/standards/sru/sru-1-2.html] 79 80 LOC-DIAG[=#REF_LOC_DIAG]:: SRU Version 1.2: SRU Diagnostics List, Library of Congress,\\ [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] 81 82 UD-POS[=#REF_UD_POS]:: Universal Dependencies, Universal POS tags v2.0, \\ [https://universaldependencies.github.io/u/pos/index.html] 83 84 SAMPA[=#REF_SAMPA]:: Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7 85 86 CLARIN-FCS-!DataViews[=#REF_FCS_DataViews]:: CLARIN Federated Content Search (CLARIN-FCS) - Data Views, SCCTC FCS Task-Force, April 2014, \\ [https://trac.clarin.eu/wiki/FCS/Dataviews] 87 88 == Non-Normative References == 89 CQP-TUTORIAL[=#REF_CQP_Tutorial]:: Evert et al.: The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, CWB Version 3.0, February 2010, \\ [http://cwb.sourceforge.net/files/CQP_Tutorial/] 90 91 RFC6838[=#REF_RFC_6838]:: Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ [http://www.ietf.org/rfc/rfc6838.txt] 92 93 RFC3023[=#REF_RFC_3023]:: XML Media Types, IETF RFC 3023, January 2001, \\ [http://www.ietf.org/rfc/rfc3023.txt] 94 95 == Typographic and XML Namespace conventions == 14 == Glossary 15 Aggregator:: 16 A module or service to dispatch queries to repositories and collect results. 17 18 [=#REF_Annotation_Layer Annotation Layer]:: 19 An annotation layer is the sum of possible annotations for a language resource, such as part of speech or orthographic transcription. Usually it is related to a given annotation task or topic. For the scope of the specification it is used as synonym for annotation tier. 20 21 CLARIN-FCS, FCS:: 22 CLARIN federated content search, an interface specification to allow searching within resource content of repositories. 23 24 Client:: 25 A software component, which implements the interface specification to query Endpoints, i.e. an aggregator or a user-interface. 26 27 CQL:: 28 Contextual Query Language, previously known as Common Query Language, is a domain specific language for representing queries to information retrieval systems such as search engines, bibliographic catalogs and museum collection information. 29 30 Data View:: 31 A Data View is a mechanism to support different representations of search results, e.g. a "hits with highlights" view, an image or a geolocation. 32 33 Data View Payload, Payload:: 34 The actual content encoded within a Data View, i.e. a CMDI metadata record or a KML encoded geolocation. 35 36 Endpoint:: 37 A software component, which implements the CLARIN-FCS interface specification and translates between CLARIN-FCS and a search engine. 38 39 FCS-QL:: 40 Federated Content Search Query Language is the query language used in the advanced CLARIN-FCS profile. It is derived from Corpus Workbench's [#REF_CQP_Tutorial CQP-TUTORIAL] 41 42 Hit:: 43 A piece of data returned by a Search Engine that matches the search criterion. What is considered a Hit highly depends on Search Engine. 44 45 Interface Specification:: 46 Common harmonized interface and suite of protocols that repositories need to implement. 47 48 Layer:: 49 See [#REF_Annotation_Layer ''Annotation Layer''] 50 51 PID:: 52 A Persistent identifier is a long-lasting reference to a digital object. 53 54 Repository:: 55 A software component at a CLARIN center that stores resources (= data) and information about these resources (= metadata). 56 57 Repository Registry:: 58 A separate service that allows registering Repositories and their Endpoints and provides information about these to other components, e.g. an Aggregator. The [http://centres.clarin.eu/ CLARIN Center Registry] is an implementation of such a repository registry. 59 60 Resource:: 61 A searchable and addressable entity at an Endpoint, such as a text corpus or a multi-modal corpus. 62 63 Resource Fragment:: 64 A smaller unit in a Resource, i.e. a sentence in a text corpus or a time interval in an audio transcription. 65 66 Result Set:: 67 An (ordered) set of hits that match a search criterion produced by a search engine as the result of processing a query. 68 69 Search Engine:: 70 A software component within a repository, that allows for searching within the repository contents. 71 72 SRU:: 73 Search and Retrieve via URL, is a protocol for Internet search queries. Originally introduced by Library of Congress [#REF_LOC_SRU_12 LOC-SRU12], later standardization process moved to OASIS [#REF_SRU_12 OASIS-SRU12]. 74 75 == Normative References 76 RFC2119[=#REF_RFC_2119]:: 77 Key words for use in RFCs to Indicate Requirement Levels, IETF RFC 2119, March 1997, \\ 78 [http://www.ietf.org/rfc/rfc2119.txt] 79 80 XML-Namespaces[=#REF_XML_Namespaces]:: 81 Namespaces in XML 1.0 (Third Edition), W3C, 8 December 2009, \\ 82 [http://www.w3.org/TR/2009/REC-xml-names-20091208/] 83 84 OASIS-SRU-Overview[=#REF_SRU_Overview]:: 85 searchRetrieve: Part 0. Overview Version 1.0, OASIS, January 2013, \\ 86 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.doc] 87 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.html (HTML)], 88 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part0-overview/searchRetrieve-v1.0-os-part0-overview.pdf (PDF)] 89 90 OASIS-SRU-APD[=#REF_SRU_APD]:: 91 searchRetrieve: Part 1. Abstract Protocol Definition Version 1.0, OASIS, January 2013, \\ 92 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.doc] 93 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.html (HTML)] 94 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part1-apd/searchRetrieve-v1.0-os-part1-apd.pdf (PDF)] 95 96 OASIS-SRU12[=#REF_SRU_12]:: 97 searchRetrieve: Part 2. SRU searchRetrieve Operation: APD Binding for SRU 1.2 Version 1.0, OASIS, January 2013, \\ 98 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.doc] 99 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.html (HTML)] 100 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part2-sru1.2/searchRetrieve-v1.0-os-part2-sru1.2.pdf (PDF)] 101 102 OASIS-SRU20[=#REF_SRU_20]:: 103 searchRetrieve: Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0, OASIS, January 2013, \\ 104 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.doc] 105 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html (HTML)] 106 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.pdf (PDF)] 107 108 OASIS-CQL[=#REF_CQL]:: 109 searchRetrieve: Part 5. CQL: The Contextual Query Language version 1.0, OASIS, January 2013, \\ 110 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.doc] 111 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.html (HTML)] 112 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part5-cql/searchRetrieve-v1.0-os-part5-cql.pdf (PDF)] 113 114 SRU-Explain[=#REF_Explain]:: 115 searchRetrieve: Part 7. SRU Explain Operation version 1.0, OASIS, January 2013, \\ 116 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.doc] 117 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.html (HTML)] 118 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part7-explain/searchRetrieve-v1.0-os-part7-explain.pdf (PDF)] 119 120 SRU-Scan[=#REF_Scan]:: 121 searchRetrieve: Part 6. SRU Scan Operation version 1.0, OASIS, January 2013, \\ 122 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.doc] 123 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.html (HTML)] 124 [http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part6-scan/searchRetrieve-v1.0-os-part6-scan.PDF (PDF)] 125 126 LOC-SRU12[=#REF_LOC_SRU_12]:: 127 SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ 128 [http://www.loc.gov/standards/sru/sru-1-2.html] 129 130 LOC-DIAG[=#REF_LOC_DIAG]:: 131 SRU Version 1.2: SRU Diagnostics List, Library of Congress,\\ 132 [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] 133 134 UD-POS[=#REF_UD_POS]:: 135 Universal Dependencies, Universal POS tags v2.0, \\ 136 [https://universaldependencies.github.io/u/pos/index.html] 137 138 SAMPA[=#REF_SAMPA]:: 139 Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7 140 141 CLARIN-FCS-!DataViews[=#REF_FCS_DataViews]:: 142 CLARIN Federated Content Search (CLARIN-FCS) - Data Views, SCCTC FCS Task-Force, April 2014, \\ 143 [https://trac.clarin.eu/wiki/FCS/Dataviews] 144 145 == Non-Normative References 146 CQP-TUTORIAL[=#REF_CQP_Tutorial]:: 147 Evert et al.: The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial, CWB Version 3.0, February 2010, \\ 148 [http://cwb.sourceforge.net/files/CQP_Tutorial/] 149 150 RFC6838[=#REF_RFC_6838]:: 151 Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ 152 [http://www.ietf.org/rfc/rfc6838.txt] 153 154 RFC3023[=#REF_RFC_3023]:: 155 XML Media Types, IETF RFC 3023, January 2001, \\ 156 [http://www.ietf.org/rfc/rfc3023.txt] 157 158 == Typographic and XML Namespace conventions 96 159 The following typographic conventions for XML fragments will be used throughout this specification: 97 98 160 * `<prefix:Element>` \\ An XML element with the Generic Identifier ''Element'' that is bound to an XML namespace denoted by the prefix ''prefix''. 99 161 * `@attr` \\ An XML attribute with the name ''attr'' 100 101 162 {{{#!comment 102 103 163 * `@prefix:attr` \\ An XML attribute with the name ''attr'' that is bound to an XML namespaces denoted by the prefix ''prefix''. 104 105 }}} 106 164 }}} 107 165 * `string` \\ The literal ''string'' must be used either as element content or attribute value. 108 109 166 Endpoints and Clients `MUST` adhere to the [#REF_XML_Namespaces XML-Namespaces] specification. The CLARIN-FCS interface specification generally does not dictate whether XML elements should be serialized in their prefixed or non-prefixed syntax, but Endpoints `MUST` ensure that the correct XML namespace is used for elements and that XML namespaces are declared correctly. Clients `MUST` be agnostic regarding syntax for serializing the XML elements, i.e. if the prefixed or un-prefixed variant was used, and `SHOULD` operate solely on ''expanded names'', i.e. pairs of ''namespace name'' and ''local name''. 110 167 111 168 The following XML namespace names and prefixes are used throughout this specification. The column "Recommended Syntax" indicates which syntax variant `SHOULD` be used by the Endpoint to serialize the XML response. 112 113 ||=Prefix =||=Namespace Name =||=Comment =||=Recommended Syntax =|| 114 || `fcs` || `http://clarin.eu/fcs/resource` || CLARIN-FCS Resources || prefixed || 115 || `ed` || `http://clarin.eu/fcs/endpoint-description` || CLARIN-FCS Endpoint Description || prefixed || 116 || `hits` || `http://clarin.eu/fcs/dataview/hits` || CLARIN-FCS Generic Hits Data View || prefixed || 117 || `adv` || `http://clarin.eu/fcs/dataview/advanced` || CLARIN-FCS Advanced Data View || prefixed || 118 || `sru` || `http://docs.oasis-open.org/ns/search-ws/sruResponse` || SRU Version 2.0 || prefixed || 119 || `diag` || `http://docs.oasis-open.org/ns/search-ws/diagnostic` || SRU Version 2.0 Diagnostics || prefixed || 120 || `zr` || `http://explain.z3950.org/dtd/2.0/` || SRU/ZeeRex Explain || prefixed || 121 || `sru` || `http://www.loc.gov/zing/srw/` || SRU Version 1.2, ''only compatibility mode'' || prefixed || 122 || `diag` || `http://www.loc.gov/zing/srw/diagnostic/` || SRU Version 1.2 Diagnostics, ''only compatibility mode'' || prefixed || 123 124 = CLARIN-FCS Interface Specification = 169 ||=Prefix =||=Namespace Name =||=Comment =||=Recommended Syntax =|| 170 || `fcs` || `http://clarin.eu/fcs/resource` || CLARIN-FCS Resources || prefixed || 171 || `ed` || `http://clarin.eu/fcs/endpoint-description` || CLARIN-FCS Endpoint Description || prefixed || 172 || `hits` || `http://clarin.eu/fcs/dataview/hits` || CLARIN-FCS Generic Hits Data View || prefixed || 173 || `adv` || `http://clarin.eu/fcs/dataview/advanced` || CLARIN-FCS Advanced Data View || prefixed || 174 || `sru` || `http://docs.oasis-open.org/ns/search-ws/sruResponse` || SRU Version 2.0 || prefixed || 175 || `diag` || `http://docs.oasis-open.org/ns/search-ws/diagnostic` || SRU Version 2.0 Diagnostics || prefixed || 176 || `zr` || `http://explain.z3950.org/dtd/2.0/` || SRU/ZeeRex Explain || prefixed || 177 || `sru` || `http://www.loc.gov/zing/srw/` || SRU Version 1.2, ''only compatibility mode'' || prefixed || 178 || `diag` || `http://www.loc.gov/zing/srw/diagnostic/` || SRU Version 1.2 Diagnostics, ''only compatibility mode'' || prefixed || 179 180 = CLARIN-FCS Interface Specification 125 181 The CLARIN-FCS Interface Specification defines a set of capabilities, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms. 126 182 127 183 Specifically, the CLARIN-FCS Interface Specification consists of two parts, a set of formats, and a transport protocol. The ''Endpoint'' component is a software component that acts as a bridge between a ''Client'' and a ''Search Engine'' and passes the requests sent by the ''Client'' to the ''Search Engine''. The ''Search Engine'' is a custom software component that allows the search of language resources in a Repository. The ''Endpoint'' implements the ''Transport Protocol'' and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of ''Search Engines'' of the individual Repositories. The following figure illustrates the overall architecture: 128 129 184 {{{ 130 185 +---------+ … … 157 212 In general, the work flow in CLARIN-FCS is as follows: a Client submits a query to an Endpoint. The Endpoint translates the query from CQL or FCS-QL to the query dialect used by the Search Engine and submits the translated query to the Search Engine. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends them to the Client. 158 213 159 == Discovery ==#Discovery214 == Discovery #Discovery 160 215 The ''Discovery'' step allows a Client to gather information about an Endpoint, in particular which capabilities are supported or which resources are available for searching. 161 216 162 === Capabilities ===217 === Capabilities 163 218 A ''Capability'' defines a certain feature set that is part of CLARIN-FCS, e.g. what kind of queries are supported. Each Endpoint implements some (or all) of these Capabilities. The Endpoint will announce the capabilities it provides to allow a Client to auto-tune itself (see section [#endpointDescription Endpoint Description]). Each Capability is identified by a ''Capability Identifier'', which uses the URI syntax. The following Capabilities are defined in CLARIN-FCS: 164 165 ||=Name =||=Capability Identifier =||=Summary =|| 166 || ''Basic Search'' || `http://clarin.eu/fcs/capability/basic-search` || Simple full-text searching || 219 ||=Name =||=Capability Identifier =||=Summary =|| 220 || ''Basic Search'' || `http://clarin.eu/fcs/capability/basic-search` || Simple full-text searching || 167 221 || ''Advanced Search'' || `http://clarin.eu/fcs/capability/advanced-search` || Searching in structured and/or annotated data || 168 222 169 223 Endpoints `MUST` implement the ''Basic Search'' Capability. Endpoints `MUST NOT` invent custom Capability Identifiers and `MUST` only use the values defined above. 170 224 171 === Endpoint Description === #endpointDescription 225 226 === Endpoint Description #endpointDescription 172 227 {{{ 173 228 #!div style="border: 1px solid #000000; font-size: 75%" … … 177 232 178 233 The XML fragment for ''Endpoint Description'' is encoded as an `<ed:EndpointDescription>` element, that contains the following attributes and children: 179 180 234 * one `@version` attribute (`REQUIRED`) on the `<ed:EndpointDescription>` element. The value of the `@version` attribute `MUST` be `2`. 181 * one `<ed:Capabilities>` element (`REQUIRED`) that contains one or more `<ed:Capability>` elements \\ The content of the `<ed:Capability>` element is a Capability Identifier, that indicates the capabilities, that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values. 182 * one `<ed:SupportedDataViews>` element (`REQUIRED`) \\ A list of Data Views that are supported by this Endpoint. This list is composed of one or more `<ed:SupportedDataView>` elements. The content of a `<ed:SupportedDataView>` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. Each `<ed:SupportedDataView>` element `MUST` carry a `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). Endpoints `SHOULD` use the recommended short identifier for the Data View. The `@delivery-policy` indicates, the Endpoint's delivery policy, for that Data View. Valid values are `send-by-default` for the ''send-by-default'' and `need-to-request` for the ''need-to-request'' delivery policy. \\ This list `MUST NOT` include duplicate entries, i.e. no MIME type must appear more than once. \\ The value of the `@id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) 183 * one `<ed:SupportedLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability) \\ A list of Layers that are generally supported by this Endpoint. This list is composed of one or more `<ed:SupportedLayer>` elements. The content of a `<ed:SupportedLayer>` `MUST` be the identifier of a Layer (see [#layers section "Layers"]), e.g. `orth`. Each `<ed:SupportedLayer>` element `MUST` carry an `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). The `@result-id` attribute is used in the Advanced Data View (see [#advancedDataView section "Advanced Data View"]). Each `<ed:SupportedLayer>` element `MAY` carry an optional `@qualifier` attribute. It is used a a qualifier in a FCS-QL search term in to address this specific layer. \\ This list `MUST NOT` include duplicate entries, i.e. no Layer with the same `@result-id` MIME type must appear more than once. \\ The value of the `@id` or `@result-id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) The value of the `@qualifier` attribute `MUST NOT` contain characters other than `a`-`z`,`A`-`Z`,`0`-`9` and `-` (hyphen). The `<ed:SupportedLayer>` element `MAY` carry an `@alt-value-info` and `@alt-value-info-uri` attribute; `@alt-value-info` `SHOULD` contain a sort description about the layer, e.g. the original tag set used; `@alt-value-info-uri` `MUST` contain a well-formed URI and `SHOULD` point to a web site with further information, e.g. about the original tag set and how the translation to FCS is done. Client, e.g. the Aggregator, can display this information together with the search result. 184 * one `<ed:Resources>` element (`REQUIRED`) \\ A list of (top-level) resources that are available, i.e. searchable, at the Endpoint. The `<ed:Resources>` element contains one or more `<ed:Resource>` elements (see below). The Endpoint `MUST` declare at least one (top-level) resource. 235 * one `<ed:Capabilities>` element (`REQUIRED`) that contains one or more `<ed:Capability>` elements \\ 236 The content of the `<ed:Capability>` element is a Capability Identifier, that indicates the capabilities, that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values. 237 * one `<ed:SupportedDataViews>` element (`REQUIRED`) \\ 238 A list of Data Views that are supported by this Endpoint. This list is composed of one or more `<ed:SupportedDataView>` elements. The content of a `<ed:SupportedDataView>` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. Each `<ed:SupportedDataView>` element `MUST` carry a `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). Endpoints `SHOULD` use the recommended short identifier for the Data View. The `@delivery-policy` indicates, the Endpoint's delivery policy, for that Data View. Valid values are `send-by-default` for the ''send-by-default'' and `need-to-request` for the ''need-to-request'' delivery policy. \\ 239 This list `MUST NOT` include duplicate entries, i.e. no MIME type must appear more than once. \\ 240 The value of the `@id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) 241 * one `<ed:SupportedLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability) \\ 242 A list of Layers that are generally supported by this Endpoint. This list is composed of one or more `<ed:SupportedLayer>` elements. The content of a `<ed:SupportedLayer>` `MUST` be the identifier of a Layer (see [#layers section "Layers"]), e.g. `orth`. Each `<ed:SupportedLayer>` element `MUST` carry an `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate, which Data View is supported by a resource (see below). The `@result-id` attribute is used in the Advanced Data View (see [#advancedDataView section "Advanced Data View"]). Each `<ed:SupportedLayer>` element `MAY` carry an optional `@qualifier` attribute. It is used a a qualifier in a FCS-QL search term in to address this specific layer. \\ 243 This list `MUST NOT` include duplicate entries, i.e. no Layer with the same `@result-id` MIME type must appear more than once. \\ 244 The value of the `@id` or `@result-id` attribute `MUST NOT` contain the characters `,` (comma) or `;` (semicolon) 245 The value of the `@qualifier` attribute `MUST NOT` contain characters other than `a`-`z`,`A`-`Z`,`0`-`9` and `-` (hyphen). 246 The `<ed:SupportedLayer>` element `MAY` carry an `@alt-value-info` and `@alt-value-info-uri` attribute; `@alt-value-info` `SHOULD` contain a sort description about the layer, e.g. the original tag set used; `@alt-value-info-uri` `MUST` contain a well-formed URI and `SHOULD` point to a web site with further information, e.g. about the original tag set and how the translation to FCS is done. Client, e.g. the Aggregator, can display this information together with the search result. 247 * one `<ed:Resources>` element (`REQUIRED`) \\ 248 A list of (top-level) resources that are available, i.e. searchable, at the Endpoint. The `<ed:Resources>` element contains one or more `<ed:Resource>` elements (see below). The Endpoint `MUST` declare at least one (top-level) resource. 185 249 186 250 The `<ed:Resource>` element contains a basic description of a resource that is available at the Endpoint. A resource is a searchable entity, e.g. a single corpus. The `<ed:Resources>` has a mandatory `@pid` attribute that contains persistent identifier of the resource. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the resource. The `<ed:Resources>` element contains the following children: 187 188 * one or more `<ed:Title>` elements (`REQUIRED`) \\ A human readable title for the resource. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language. 189 * zero or more `<ed:Description>` elements (`OPTIONAL`) \\ An optional human-readable description of the resource. It `SHOULD` be at most one sentence. A `REQUIRED` `@xml:lang` attribute indicates the language of the description. If supplied, an English version of the description is `REQUIRED`. The list of descriptions `MUST NOT` contain duplicate entries for the same language. 190 * zero or one `<ed:LandingPageURI>` element (`OPTIONAL`) \\ A link to a website for the resource, e.g. a landing page for a resource, i.e. a web-site that describes a corpus. 191 * one `<ed:Languages>` element (`REQUIRED`) \\ The (relevant) languages available within the resource. The `<ed:Languages>` element contains one or more `<ed:Language>` elements. The content of a `<ed:Language>` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the resource, however this list `MUST NOT` contain duplicate entries. 192 * one `<ed:AvailableDataViews>` element (`REQUIRED`) \\ The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\ In case of sub-resources, each Resource `SHOULD` support all Data Views that are supported by the parent resource. However, every resource `MUST` declare all available Data Views independently, i.e. there is no implicit inheritance semantic. 193 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\ In case of sub-resources, each Resource `SHOULD` support all Layers that are supported by the parent resource. However, every resource `MUST` declare all available Layers independently, i.e. there is no implicit inheritance semantic. 194 * zero or one `<ed:Resources>` element (`OPTIONAL`) \\ If a resource has searchable sub-resources, the Endpoint `MUST` supply additional finer grained resource elements, which are wrapped in a `<ed:Resources>` element. A sub-resource is a searchable entity within a resource, e.g. a sub-corpus. 195 196 [=#REF_Example_4]Example 4: {{{#!xml <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 197 198 <ed:Capabilities> 199 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > 200 </ed:Capabilities > <ed:SupportedDataViews> 201 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > 202 </ed:SupportedDataViews > <ed:Resources> 203 <!-- just one top-level resource at the Endpoint --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 204 <ed:Title xml:lang="de"> Goethe Korpus</ed:Title > <ed:Title xml:lang="en"> Goethe corpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 205 <ed:Language> deu</ed:Language > 206 </ed:Languages > <ed:AvailableDataViews ref="hits" /> 207 </ed:Resource > 208 </ed:Resources > 209 210 </ed:EndpointDescription> }}} [#REF_Example_4 Example 4] shows a simple Endpoint Description for an Endpoint that only supports the ''Basic Search'' Capability and only provides the Generic Hits Data View, which is indicated by a `<ed:SupportedDataView>` element. This element carries a `@id` attribute with a value of `hits`, the recommended value for the short identifier, and indicates a delivery policy of ''send-by-default'' by the `@delivery-policy` attribute. It only provides one top-level resource identified by the persistent identifier `http://hdl.handle.net/4711/0815`. The resource has a title as well as a description in German and English. A landing page is located at `http://repos.example.org/corpus1.html`. The predominant language in the resource contents is German. Only the Generic Hits Data View is supported for this resource, because the `<ed:AvailableDataViews>` element only references the `<ed:SupporedDataView>` element with the `@id` with a value of `hits`. 211 212 [=#REF_Example_5]Example 5: {{{#!xml <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 213 214 <ed:Capabilities> 215 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > 216 </ed:Capabilities > <ed:SupportedDataViews> 217 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > <ed:SupportedDataView id="cmdi" delivery-policy="need-to-request"> application/x-cmdi+xml</ed:SupportedDataView > 218 </ed:SupportedDataViews > <ed:Resources> 219 <!-- top-level resource 1 --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 220 <ed:Title xml:lang="de"> Goethe Korpus</ed:Title > <ed:Title xml:lang="en"> Goethe corpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 221 <ed:Language> deu</ed:Language > 222 </ed:Languages > <ed:AvailableDataViews ref="hits" /> 223 </ed:Resource > <!-- top-level resource 2 --> <ed:Resource pid="http://hdl.handle.net/4711/0816"> 224 <ed:Title xml:lang="de"> Zeitungskorpus des Mannheimer Morgen</ed:Title > <ed:Title xml:lang="en"> Mannheimer Morgen newspaper corpus</ed:Title > <ed:LandingPageURI> http://repos.example.org/corpus2.html </ed:LandingPageURI > <ed:Languages> 225 <ed:Language> deu</ed:Language > 226 </ed:Languages > <ed:AvailableDataViews ref="hits cmdi" /> <ed:Resources> 227 <!-- sub-resource 1 of top-level resource 2 --> <ed:Resource pid="http://hdl.handle.net/4711/0816-1"> 228 <ed:Title xml:lang="de"> Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title > <ed:Title xml:lang="en"> Mannheimer Morgen newspaper corpus (before 1990)</ed:Title > <ed:LandingPageURI> http://repos.example.org/corpus2.html#sub1 </ed:LandingPageURI > <ed:Languages> 229 <ed:Language> deu</ed:Language > 230 </ed:Languages > <ed:AvailableDataViews ref="hits cmdi" /> 231 </ed:Resource > <!-- sub-resource 2 of top-level resource 2 --> <ed:Resource pid="http://hdl.handle.net/4711/0816-2"> 232 <ed:Title xml:lang="de"> Zeitungskorpus des Mannheimer Morgen (nach 1990)</ed:Title > <ed:Title xml:lang="en"> Mannheimer Morgen newspaper corpus (after 1990)</ed:Title > <ed:LandingPageURI> http://repos.example.org/corpus2.html#sub2 </ed:LandingPageURI > <ed:Languages> 233 <ed:Language> deu</ed:Language > 234 </ed:Languages > <ed:AvailableDataViews ref="hits cmdi" /> 235 </ed:Resource > 236 </ed:Resources > 237 </ed:Resource > 238 </ed:Resources > 239 240 </ed:EndpointDescription> }}} The more complex [#REF_Example_5 Example 5] show an Endpoint Description for an Endpoint that, similar to [#REF_Example_4 Example 4], supports the ''Basic Search'' capability. In addition to the Generic Hits Data View, it also supports the CMDI Data View. The delivery polices are ''send-by-default'' for the Generic Hits Data View and ''need-to-request'' for the CMDI Data View. The Endpoint has two top-level resources (identified by the persistent identifiers `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816`. The second top-level resource has two independently searchable sub-resources, identified by the persistent identifier `http://hdl.handle.net/4711/0816-1` and `http://hdl.handle.net/4711/0816-2`. All resources are described using several properties, like title, description, etc. The first top-level resource provides only the Generic Hits Data View, while the other top-level resource including its children provide the Generic Hits and the CMDI Data Views. 241 242 [=#REF_Example_6]Example 6: {{{#!xml <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 243 244 <ed:Capabilities> 245 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > <ed:Capability> http://clarin.eu/fcs/capability/advanced-search </ed:Capability > 246 </ed:Capabilities > <ed:SupportedDataViews> 247 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > 248 </ed:SupportedDataViews > <!-- ADV-FCS --> <SupportedLayers > 249 <SupportedLayer id="l1" result-id="http://endpoint.example.org/Layers/orth1 ">orth</SupportedLayer > <SupportedLayer id="l2" result-id="http://endpoint.example.org/Layers/pos1 " qualifier="x">pos</SupportedLayer > <SupportedLayer id="l3" result-id="http://endpoint.example.org/Layers/pos2 " qualifier="y" 250 alt-value-info="STTS tagset" alt-value-info-uri="http://repos.example.org/tagset_doc.html ">pos</SupportedLayer > 251 <SupportedLayer id="l4" result-id="http://endpoint.example.org/Layers/word " type="empty">word</SupportedLayer > <SupportedLayer id="l5" result-id="http://endpoint.example.org/Layers/lemma1 ">lemma</SupportedLayer > 252 </SupportedLayers > 253 254 <ed:Resources> 255 <!-- just one top-level resource at the Endpoint --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 256 <ed:Title xml:lang="de"> Goethe Korpus</ed:Title > <ed:Title xml:lang="en"> Goethe corpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 257 <ed:Language> deu</ed:Language > 258 </ed:Languages > <ed:AvailableDataViews ref="hits" /> <AvailableLayers ref="l1 l2 l3 l4 l5" /> 259 </ed:Resource > 260 </ed:Resources > 261 262 </ed:EndpointDescription> }}} 263 251 * one or more `<ed:Title>` elements (`REQUIRED`) \\ 252 A human readable title for the resource. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language. 253 * zero or more `<ed:Description>` elements (`OPTIONAL`) \\ 254 An optional human-readable description of the resource. It `SHOULD` be at most one sentence. A `REQUIRED` `@xml:lang` attribute indicates the language of the description. If supplied, an English version of the description is `REQUIRED`. The list of descriptions `MUST NOT` contain duplicate entries for the same language. 255 * zero or one `<ed:LandingPageURI>` element (`OPTIONAL`) \\ 256 A link to a website for the resource, e.g. a landing page for a resource, i.e. a web-site that describes a corpus. 257 * one `<ed:Languages>` element (`REQUIRED`) \\ 258 The (relevant) languages available within the resource. The `<ed:Languages>` element contains one or more `<ed:Language>` elements. The content of a `<ed:Language>` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the resource, however this list `MUST NOT` contain duplicate entries. 259 * one `<ed:AvailableDataViews>` element (`REQUIRED`) \\ 260 The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\ 261 In case of sub-resources, each Resource `SHOULD` support all Data Views that are supported by the parent resource. However, every resource `MUST` declare all available Data Views independently, i.e. there is no implicit inheritance semantic. 262 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute, that contains a whitespace separated list of id values, that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\ 263 In case of sub-resources, each Resource `SHOULD` support all Layers that are supported by the parent resource. However, every resource `MUST` declare all available Layers independently, i.e. there is no implicit inheritance semantic. 264 * zero or one `<ed:Resources>` element (`OPTIONAL`) \\ 265 If a resource has searchable sub-resources, the Endpoint `MUST` supply additional finer grained resource elements, which are wrapped in a `<ed:Resources>` element. A sub-resource is a searchable entity within a resource, e.g. a sub-corpus. 266 267 [=#REF_Example_4]Example 4: 268 {{{#!xml 269 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 270 <ed:Capabilities> 271 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 272 </ed:Capabilities> 273 <ed:SupportedDataViews> 274 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 275 </ed:SupportedDataViews> 276 <ed:Resources> 277 <!-- just one top-level resource at the Endpoint --> 278 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 279 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 280 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 281 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 282 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 283 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 284 <ed:Languages> 285 <ed:Language>deu</ed:Language> 286 </ed:Languages> 287 <ed:AvailableDataViews ref="hits" /> 288 </ed:Resource> 289 </ed:Resources> 290 </ed:EndpointDescription> 291 }}} 292 [#REF_Example_4 Example 4] shows a simple Endpoint Description for an Endpoint that only supports the ''Basic Search'' Capability and only provides the Generic Hits Data View, which is indicated by a `<ed:SupportedDataView>` element. This element carries a `@id` attribute with a value of `hits`, the recommended value for the short identifier, and indicates a delivery policy of ''send-by-default'' by the `@delivery-policy` attribute. It only provides one top-level resource identified by the persistent identifier `http://hdl.handle.net/4711/0815`. The resource has a title as well as a description in German and English. A landing page is located at `http://repos.example.org/corpus1.html`. The predominant language in the resource contents is German. Only the Generic Hits Data View is supported for this resource, because the `<ed:AvailableDataViews>` element only references the `<ed:SupporedDataView>` element with the `@id` with a value of `hits`. 293 294 [=#REF_Example_5]Example 5: 295 {{{#!xml 296 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 297 <ed:Capabilities> 298 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 299 </ed:Capabilities> 300 <ed:SupportedDataViews> 301 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 302 <ed:SupportedDataView id="cmdi" delivery-policy="need-to-request">application/x-cmdi+xml</ed:SupportedDataView> 303 </ed:SupportedDataViews> 304 <ed:Resources> 305 <!-- top-level resource 1 --> 306 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 307 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 308 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 309 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 310 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 311 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 312 <ed:Languages> 313 <ed:Language>deu</ed:Language> 314 </ed:Languages> 315 <ed:AvailableDataViews ref="hits" /> 316 </ed:Resource> 317 <!-- top-level resource 2 --> 318 <ed:Resource pid="http://hdl.handle.net/4711/0816"> 319 <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen</ed:Title> 320 <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus</ed:Title> 321 <ed:LandingPageURI>http://repos.example.org/corpus2.html</ed:LandingPageURI> 322 <ed:Languages> 323 <ed:Language>deu</ed:Language> 324 </ed:Languages> 325 <ed:AvailableDataViews ref="hits cmdi" /> 326 <ed:Resources> 327 <!-- sub-resource 1 of top-level resource 2 --> 328 <ed:Resource pid="http://hdl.handle.net/4711/0816-1"> 329 <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (vor 1990)</ed:Title> 330 <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (before 1990)</ed:Title> 331 <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub1</ed:LandingPageURI> 332 <ed:Languages> 333 <ed:Language>deu</ed:Language> 334 </ed:Languages> 335 <ed:AvailableDataViews ref="hits cmdi" /> 336 </ed:Resource> 337 <!-- sub-resource 2 of top-level resource 2 --> 338 <ed:Resource pid="http://hdl.handle.net/4711/0816-2"> 339 <ed:Title xml:lang="de">Zeitungskorpus des Mannheimer Morgen (nach 1990)</ed:Title> 340 <ed:Title xml:lang="en">Mannheimer Morgen newspaper corpus (after 1990)</ed:Title> 341 <ed:LandingPageURI>http://repos.example.org/corpus2.html#sub2</ed:LandingPageURI> 342 <ed:Languages> 343 <ed:Language>deu</ed:Language> 344 </ed:Languages> 345 <ed:AvailableDataViews ref="hits cmdi" /> 346 </ed:Resource> 347 </ed:Resources> 348 </ed:Resource> 349 </ed:Resources> 350 </ed:EndpointDescription> 351 }}} 352 The more complex [#REF_Example_5 Example 5] show an Endpoint Description for an Endpoint that, similar to [#REF_Example_4 Example 4], supports the ''Basic Search'' capability. In addition to the Generic Hits Data View, it also supports the CMDI Data View. The delivery polices are ''send-by-default'' for the Generic Hits Data View and ''need-to-request'' for the CMDI Data View. The Endpoint has two top-level resources (identified by the persistent identifiers `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816`. The second top-level resource has two independently searchable sub-resources, identified by the persistent identifier `http://hdl.handle.net/4711/0816-1` and `http://hdl.handle.net/4711/0816-2`. All resources are described using several properties, like title, description, etc. The first top-level resource provides only the Generic Hits Data View, while the other top-level resource including its children provide the Generic Hits and the CMDI Data Views. 353 354 [=#REF_Example_6]Example 6: 355 {{{#!xml 356 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="2"> 357 <ed:Capabilities> 358 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 359 <ed:Capability>http://clarin.eu/fcs/capability/advanced-search</ed:Capability> 360 </ed:Capabilities> 361 <ed:SupportedDataViews> 362 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 363 </ed:SupportedDataViews> 364 <!-- ADV-FCS --> 365 <SupportedLayers> 366 <SupportedLayer id="l1" result-id="http://endpoint.example.org/Layers/orth1">orth</SupportedLayer> 367 <SupportedLayer id="l2" result-id="http://endpoint.example.org/Layers/pos1" qualifier="x">pos</SupportedLayer> 368 <SupportedLayer id="l3" result-id="http://endpoint.example.org/Layers/pos2" qualifier="y" 369 alt-value-info="STTS tagset" 370 alt-value-info-uri="http://repos.example.org/tagset_doc.html">pos</SupportedLayer> 371 <SupportedLayer id="l4" result-id="http://endpoint.example.org/Layers/word" type="empty">word</SupportedLayer> 372 <SupportedLayer id="l5" result-id="http://endpoint.example.org/Layers/lemma1">lemma</SupportedLayer> 373 </SupportedLayers> 374 375 <ed:Resources> 376 <!-- just one top-level resource at the Endpoint --> 377 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 378 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 379 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 380 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 381 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 382 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 383 <ed:Languages> 384 <ed:Language>deu</ed:Language> 385 </ed:Languages> 386 <ed:AvailableDataViews ref="hits" /> 387 <AvailableLayers ref="l1 l2 l3 l4 l5" /> 388 </ed:Resource> 389 </ed:Resources> 390 </ed:EndpointDescription> 391 }}} 264 392 {{{ 265 393 #!div style="border: 1px solid #000000; font-size: 75%" 266 394 TODO: describe the above example 267 395 }}} 268 == Searching == 396 397 == Searching 269 398 In the ''Searching'' step the Client performs the actual search request to a previously [#Discovery discovered] Endpoint. 270 399 271 === Basic Search ===#basicSearch400 === Basic Search #basicSearch 272 401 The ''Basic Search'' capability provides simple full-text search. Queries in Basic Search `MUST` be performed in the ''Contextual Query Language'' ([#REF_CQL OASIS-CQL]). The Endpoint `MUST` support ''term-only'' queries. The Endpoint `SHOULD` support ''terms'' combined with boolean operator queries (''AND'' and ''OR''), including sub-queries. An Endpoint `MAY` also support ''NOT'' or ''PROX'' operator queries. If an Endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). 273 402 … … 275 404 276 405 Examples of valid CQL queries for Basic Search are: 277 278 406 {{{ 279 407 cat … … 285 413 cat AND (mouse OR "lazy dog") 286 414 }}} 287 '''NOTE''': In CQL, a ''term'' can be a single token or a phrase, i.e. tokens separated by spaces. If a single ''term'' contains spaces, it needs to be quoted. \\ '''NOTE''': Endpoints `MUST` be able to parse all of CQL. If they don't support a certain CQL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). Especially, if an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc. 288 289 === Advanced Search === 415 416 '''NOTE''': In CQL, a ''term'' can be a single token or a phrase, i.e. tokens separated by spaces. If a single ''term'' contains spaces, it needs to be quoted. \\ 417 '''NOTE''': Endpoints `MUST` be able to parse all of CQL. If they don't support a certain CQL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). Especially, if an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc. 418 419 === Advanced Search 290 420 The ''Advanced Search'' capability allows searching in annotated data, that is represented in annotation layers. An annotation ''layer'' contains annotations of a specific type, e.g. lemma or part-of-speech layer. Queries can be across annotation layer. 291 421 292 422 CLARIN-FCS defined a set of searchable annotation layers with certain semantics and syntax. Endpoints `SHOULD` support as many different, of course depending on the resource type, annotation layers as possible. 293 423 294 ==== Layers ====#layers424 ==== Layers #layers 295 425 Each Layer is assumed to be ''segmented'', e.g. to allow for searching for a single lemma. However, CLARIN-FCS does not endorse a specific segmentation, i.e. the segmentation of Layers is in the domain of the Endpoint and ''opaque'' to CLARIN-FCS. CLARIN-FCS '''does not''' endorse nor assume a ''formal linguistic relation'' or ''formal linguistic hierarchy'' between two items on two different layers. 296 426 297 ||=Layer Type Identifier =||=Annotation Layer Description =||=Syntax =||=Examples (without quotes) =|| 298 || `text` || Textual representation of resource, also the layer that is used in [#basicSearch Basic Search] || ''String'' || "Dog", "cat" "walking", "better" || 299 || `lemma` || Lemmatisation || ''String'' || "good", "walk", "dog" || 300 || `pos` || Part-of-Speech annotations || [#REF_UD_POS Universal POS tags] || "NOUN", "VERB", "ADJ" || 301 || `orth` || Orthographic transcription of (mostly) spoken resources || ''String'' || "dug", "cat", "wolking" || 302 || `norm` || Orthographic normalization of (mostly) spoken resources || ''String'' || "dog", "cat", "walking", "best" || 303 || `phonetic` || Phonetic transcription || [#REF_SAMPA SAMPA] || "'du:", "'vi:-d6 'ha:-b@n" || 304 305 The column ''Layer Type Identifier'' denotes the identifier for a layer. It is used in [#fcsQL FCS-QL] queries and the XML serialization for the [#advancedDataView Advanced Data View]. All valid identifiers are defined in the table above, all other identifiers are reserved and `MUST NOT` be used. Clients and Endpoints `MAY` create custom Layer Type Identifiers, e.g. for testing proposed. If they so so, the custom Layer Type identifiers `MUST` start with the String `x-`, e.g. `x-customLayer`. The column ''Syntax'' describes the inventory of symbols that a Client `MUST` use with a corresponding annotation layer; the value ''String'' denotes that symbols are arbitrary Unicode Strings, i.e. no fixed inventory of symbols are defined. An Endpoint `SHOULD` provide an appropriate error, if a Client used an invalid value. 306 307 ==== FCS-QL ==== #fcsQL 427 ||=Layer Type Identifier =||=Annotation Layer Description =||=Syntax =||=Examples (without quotes) =|| 428 || `text` || Textual representation of resource, also the layer that is used in [#basicSearch Basic Search] || ''String'' || "Dog", "cat" "walking", "better" || 429 || `lemma` || Lemmatisation || ''String'' || "good", "walk", "dog" || 430 || `pos` || Part-of-Speech annotations || [#REF_UD_POS Universal POS tags] || "NOUN", "VERB", "ADJ" || 431 || `orth` || Orthographic transcription of (mostly) spoken resources || ''String'' || "dug", "cat", "wolking" || 432 || `norm` || Orthographic normalization of (mostly) spoken resources || ''String'' || "dog", "cat", "walking", "best" || 433 || `phonetic` || Phonetic transcription || [#REF_SAMPA SAMPA] || "'du:", "'vi:-d6 'ha:-b@n" || 434 435 The column ''Layer Type Identifier'' denotes the identifier for a layer. It is used in [#fcsQL FCS-QL] queries and the XML serialization for the [#advancedDataView Advanced Data View]. All valid identifiers are defined in the table above, all other identifiers are reserved and `MUST NOT` be used. Clients and Endpoints `MAY` create custom Layer Type Identifiers, e.g. for testing proposed. If they so so, the custom Layer Type identifiers `MUST` start with the String `x-`, e.g. `x-customLayer`. 436 The column ''Syntax'' describes the inventory of symbols that a Client `MUST` use with a corresponding annotation layer; the value ''String'' denotes that symbols are arbitrary Unicode Strings, i.e. no fixed inventory of symbols are defined. An Endpoint `SHOULD` provide an appropriate error, if a Client used an invalid value. 437 438 ==== FCS-QL #fcsQL 308 439 {{{ 309 440 #!div style="border: 1px solid #000000; font-size: 75%" … … 315 446 316 447 Examples of valid FCS-QL queries for ''Advanced Search'' are: 317 318 448 {{{ 319 449 "walking" … … 329 459 [z:pos = "ADJ" & q:pos = "ADJ"] 330 460 }}} 331 The qualifiers ''z'' in ''z:pos'' and ''q'' in ''q:pos'' `SHOULD` match an available qualifier attribute value in a ''pos''-`SupportedLayer` in a discovered ''EndpointDescripion''. 332 333 '''NOTE''': Endpoints supporting ''Advanced Search'' `MUST` be able to parse all of FCS-QL. If they don't support a certain FCS-QL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). If an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include FCS-QL features.\\ '''NOTE''': FCS-QL layer identifiers are reserved. The Endpoint `MUST` prepend the local prefix `x-` to any identifier used outside of the reserved set, e.g., `x-customLayer` for a local identifier `customLayer`. 334 335 === Result Format === 461 462 The qualifiers ''z'' in ''z:pos'' and ''q'' in ''q:pos'' `SHOULD` match an available qualifier attribute value in a ''pos''-{{{SupportedLayer}}} in a discovered ''EndpointDescripion''. 463 464 465 '''NOTE''': Endpoints supporting ''Advanced Search'' `MUST` be able to parse all of FCS-QL. If they don't support a certain FCS-QL feature, they `MUST` generate an appropriate error message (see section [#sruCQL SRU/CQL]). If an Endpoint ''only'' supports ''Basic Search'', it `MUST NOT` silently accept queries that include FCS-QL features.\\ 466 '''NOTE''': FCS-QL layer identifiers are reserved. The Endpoint `MUST` prepend the local prefix {{{x-}}} to any identifier used outside of the reserved set, e.g., {{{x-customLayer}}} for a local identifier {{{customLayer}}}. 467 468 469 === Result Format 336 470 {{{ 337 471 #!div style="border: 1px solid #000000; font-size: 75%" … … 342 476 CLARIN-FCS uses a customized format for returning results. ''Resource'' and ''Resource Fragments'' serve as containers for hit results, which are presented in one or more ''Data View''. The following section describes the Resource format and Data View format and section [#searchRetrieve Operation ''searchRetrieve''] will describe, how hits are embedded within SRU responses. 343 477 344 ==== Resource and !ResourceFragment ====478 ==== Resource and !ResourceFragment 345 479 To encode search results, CLARIN-FCS supports two building blocks: 346 347 Resources:: A ''Resource'' is a ''searchable'' and ''addressable'' entity at the Endpoint, such as a text corpus or a multi-modal corpus. A resource `SHOULD` be a self-contained unit, i.e. not a single sentence in a text corpus or a time interval in an audio transcription, but rather a complete document from a text corpus or a complete audio transcription. 348 Resource Fragments:: A ''Resource Fragment'' is a smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription. 480 Resources:: 481 A ''Resource'' is a ''searchable'' and ''addressable'' entity at the Endpoint, such as a text corpus or a multi-modal corpus. A resource `SHOULD` be a self-contained unit, i.e. not a single sentence in a text corpus or a time interval in an audio transcription, but rather a complete document from a text corpus or a complete audio transcription. 482 Resource Fragments:: 483 A ''Resource Fragment'' is a smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription. 349 484 350 485 A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit (for example if the hit is a sentence within a large text). A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If the Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that within the Resource Fragment. … … 362 497 Endpoints `MAY` serialize hits as multiple Data Views, however they `MUST` provide the Generic Hits (HITS) Data View either encoded as a Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable Resource Fragment). Other Data Views `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata Data View would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment. 363 498 364 [=#REF_Example_1]Example 1: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15"> 365 499 [=#REF_Example_1]Example 1: 500 {{{#!xml 501 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/00-15"> 366 502 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 367 <!-- data view payload omitted --> 368 </fcs:DataView > 369 370 </fcs:Resource> }}} [#REF_Example_1 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. 371 372 [=#REF_Example_2]Example 2: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15"> 373 503 <!-- data view payload omitted --> 504 </fcs:DataView> 505 </fcs:Resource> 506 }}} 507 [#REF_Example_1 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. 508 509 [=#REF_Example_2]Example 2: 510 {{{#!xml 511 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15"> 374 512 <fcs:ResourceFragment> 375 513 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 376 514 <!-- data view payload omitted --> 377 </fcs:DataView > 378 </fcs:ResourceFragment > 379 380 </fcs:Resource> }}} [#REF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the Endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. 381 382 [=#REF_Example_3]Example 3: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" 383 384 pid="http://hdl.handle.net/4711/08-15 " ref="http://repos.example.org/file/text_08_15.html "> 385 386 <fcs:DataView type="application/x-cmdi+xml" 387 388 pid="http://hdl.handle.net/4711/08-15-1 " ref="http://repos.example.org/file/08_15_1.cmdi "> 389 390 <!-- data view payload omitted --> 391 392 </fcs:DataView > <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" ref="http://repos.example.org/file/text_08_15.html#sentence2"> 515 </fcs:DataView> 516 </fcs:ResourceFragment> 517 </fcs:Resource> 518 }}} 519 [#REF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the Endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. 520 521 [=#REF_Example_3]Example 3: 522 {{{#!xml 523 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" 524 pid="http://hdl.handle.net/4711/08-15" ref="http://repos.example.org/file/text_08_15.html"> 525 <fcs:DataView type="application/x-cmdi+xml" 526 pid="http://hdl.handle.net/4711/08-15-1" ref="http://repos.example.org/file/08_15_1.cmdi"> 527 <!-- data view payload omitted --> 528 </fcs:DataView> 529 <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" ref="http://repos.example.org/file/text_08_15.html#sentence2"> 393 530 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 394 531 <!-- data view payload omitted --> 395 </fcs:DataView > 396 </fcs:ResourceFragment > 397 398 </fcs:Resource> }}} The more complex [#REF_Example_3 Example 3] is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, which is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. The Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. All entities of the Hit can be referenced by a persistent identifier and a URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`. 399 400 ==== Data View ==== 532 </fcs:DataView> 533 </fcs:ResourceFragment> 534 </fcs:Resource> 535 }}} 536 The more complex [#REF_Example_3 Example 3] is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, which is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. The Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. 537 All entities of the Hit can be referenced by a persistent identifier and a URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`. 538 539 ==== Data View 401 540 A ''Data View'' serves as a container for encoding the actual search results (the data fragments relevant to search) within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e. they are deliberately kept open to allow further extensions with more supported Data View formats. This specification only defines a ''most basic'' Data View for representing search results, called ''Generic Hits'' (see below). More Data Views are defined in the supplementary specification [#REF_FCS_DataViews CLARIN-FCS-DataViews]. 402 541 … … 413 552 '''NOTE''': The examples in the following sections ''show only'' the payload with the enclosing `<fcs:DataView>` element of a Data View. Of course, the Data View must be embedded either in a `<fcs:Resource>` or a `<fcs:ResourceFragment>` element. The `@pid` and `@ref` attributes have been omitted for all ''inline'' payload types. 414 553 415 ===== Generic Hits (HITS) =====416 ||=Description =|| The representation of the hit ||417 ||=MIME type =|| `application/x-clarin-fcs-hits+xml` ||418 ||=Payload Disposition =|| ''inline'' ||419 ||=Payload Delivery =|| ''send-by-default'' (`REQUIRED`) ||554 ===== Generic Hits (HITS) 555 ||=Description =|| The representation of the hit || 556 ||=MIME type =|| `application/x-clarin-fcs-hits+xml` || 557 ||=Payload Disposition =|| ''inline'' || 558 ||=Payload Delivery =|| ''send-by-default'' (`REQUIRED`) || 420 559 ||=Recommended Short Identifier =|| `hits` (`RECOMMENDED`) || 421 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Hits.xsd DataView-Hits.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Hits.xsd?format=txt download]) || 422 560 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Hits.xsd DataView-Hits.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Hits.xsd?format=txt download]) || 423 561 The ''Generic Hits'' Data View serves as the ''most basic'' agreement in CLARIN-FCS for serialization of search results and `MUST` be implemented by all Endpoints. In many cases, this Data View can only serve as an (lossy) approximation, because resources at Endpoints are very heterogeneous. For instance, the Generic Hits Data View is probably not the best representation for a hit result in a corpus of spoken language, but an architecture like CLARIN-FCS requires one common representation to be implemented by all Endpoints, therefore this Data View was defined. The Generic Hits Data View supports multiple markers for supplying highlighting for an individual hit, e.g. if a query contains a (boolean) conjunction, the Endpoint can use multiple markers to provide individual highlights for the matching terms. An Endpoint `MUST NOT` use this Data View to aggregate several hits within one resource. Each hit `SHOULD` be presented within the context of a complete sentence. If that is not possible due to the nature of the type of the resource, the Endpoint `MUST` provide an equivalent reasonable unit of context (e.g. within a phrase of an orthographic transcription of an utterance). The `<hits:Hit>` element within the `<hits:Result>` element is not enforced by the XML schema, but Endpoints are `RECOMMENDED` to use it. The XML fragment of the Generic Hits payload `MUST` be valid according to the XML schema "[source:FederatedSearch/schema/Core_2/DataView-Hits.xsd DataView-Hits.xsd]" ([source:FederatedSearch/schema/Core_2/DataView-Hits.xsd?format=txt download]). 424 425 562 * Example (single hit marker): 426 427 {{{#!xml <!-- potential @pid and @ref attributes omitted --> <fcs:DataView type="application/x-clarin-fcs-hits+xml">428 563 {{{#!xml 564 <!-- potential @pid and @ref attributes omitted --> 565 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 429 566 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 430 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy dog. 431 </hits:Result > 432 433 </fcs:DataView> }}} 434 567 The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog. 568 </hits:Result> 569 </fcs:DataView> 570 }}} 435 571 * Example (multiple hit markers): 436 437 {{{#!xml <!-- potential @pid and @ref attributes omitted --> <fcs:DataView type="application/x-clarin-fcs-hits+xml">438 572 {{{#!xml 573 <!-- potential @pid and @ref attributes omitted --> 574 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 439 575 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 440 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy <hits:Hit> dog</hits:Hit>.441 </hits:Result 442 443 </fcs:DataView>}}}444 445 ===== Advanced (ADV) =====#advancedDataView446 ||=Description =|| The representation of the hit for Advanced Search ||447 ||=MIME type =|| `application/x-clarin-fcs-adv+xml` ||448 ||=Payload Disposition =|| ''inline'' ||449 ||=Payload Delivery =|| ''send-by-default'' (`REQUIRED`) ||576 The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>. 577 </hits:Result> 578 </fcs:DataView> 579 }}} 580 581 ===== Advanced (ADV) #advancedDataView 582 ||=Description =|| The representation of the hit for Advanced Search || 583 ||=MIME type =|| `application/x-clarin-fcs-adv+xml` || 584 ||=Payload Disposition =|| ''inline'' || 585 ||=Payload Delivery =|| ''send-by-default'' (`REQUIRED`) || 450 586 ||=Recommended Short Identifier =|| `adv` (`RECOMMENDED`) || 451 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd DataView-Advanced.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd?format=txt download]) ||587 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd DataView-Advanced.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd?format=txt download]) || 452 588 453 589 {{{ … … 455 591 TODO: describe! 456 592 }}} 457 * ADV Data View allows to return structured information for Advanced Search queries 458 * organized in one or more annotation layers 459 * annotation layer := annotations of a specific type, e.g. part-of-speech or orthographic transcription 460 * annotations of two different annotation layers may freely overlap; no self-overlap in an annotation layer 461 * Data View serialization in a stand-off like format, i.e annotations are ranges over the signal (= language resource as character, token or audio stream) are denoted by start and end offsets 462 * layers alignable (through their offsets) and referable (trough their layer identifier) 463 * ADV Data View serialization: 464 * a list of segments (= "inventory" of all ranges used to describe annotations") 465 * units can be "items" (= offsets in character or token-stream) or "timestamp" (timestamps in audio-stream), timestamps may have a resolution of up to 1/1000 second. 466 * endpoints are responsible for choosing proper offsets for segments. they must do so in a consistent manner, i.e. in a single result (= ADV Data View instance) the chosen offsets must allow for aligning the segments of different layers. a recommendation for character streams: character := Unicode codepoint, normalized to Unicode Normalization Form KC (NFKC; Compatibility Decomposition, followed by Canonical Composition) 467 * segments may also have an endpoint specific reference (= URI); can be show in aggregator and if user clicks link can open a viewer (e.g. audio-player) at the endpoint 468 * a list of layers, each has a type (e.g. "pos", "lemma", see Layer Type identifier in section Layers above) and an layer identifier (= URI) 469 * a layer consists of one or more Spans. A span references a segment (and thus inherits the start- and en- offsets) and contains the actual annotation (e.g. the port-of-speech label) in it's content; MAY also contain alt-value (e.g. original annotation value) 470 * document order of layer elements define the view order in the Aggregator 471 * endpoints should return at least all layers that where referenced the query; they may return more 472 * Hit Makers are added by marking Spans as hits (add `@highlight` attribute); multiple hit-makers are supported and Aggregator may display them visually distinct 473 * where to add hit markers is up to the endpoint; generally "things" that where referenced in the query should be marked. 593 594 - ADV Data View allows to return structured information for Advanced Search queries 595 - organized in one or more annotation layers 596 - annotation layer := annotations of a specific type, e.g. part-of-speech or orthographic transcription 597 - annotations of two different annotation layers may freely overlap; no self-overlap in an annotation layer 598 - Data View serialization in a stand-off like format, i.e annotations are ranges over the signal (= language resource as character, token or audio stream) are denoted by start and end offsets 599 - layers alignable (through their offsets) and referable (trough their layer identifier) 600 - ADV Data View serialization: 601 - a list of segments (= "inventory" of all ranges used to describe annotations") 602 - units can be "items" (= offsets in character or token-stream) or "timestamp" (timestamps in audio-stream), timestamps may have a resolution of up to 1/1000 second. 603 - endpoints are responsible for choosing proper offsets for segments. they must do so in a consistent manner, i.e. in a single result (= ADV Data View instance) the chosen offsets must allow for aligning the segments of different layers. a recommendation for character streams: character := Unicode codepoint, normalized to Unicode Normalization Form KC (NFKC; Compatibility Decomposition, followed by Canonical Composition) 604 - segments may also have an endpoint specific reference (= URI); can be show in aggregator and if user clicks link can open a viewer (e.g. audio-player) at the endpoint 605 - a list of layers, each has a type (e.g. "pos", "lemma", see Layer Type identifier in section Layers above) and an layer identifier (= URI) 606 - a layer consists of one or more Spans. A span references a segment (and thus inherits the start- and en- offsets) and contains the actual annotation (e.g. the port-of-speech label) in it's content; MAY also contain alt-value (e.g. original annotation value) 607 - document order of layer elements define the view order in the Aggregator 608 - endpoints should return at least all layers that where referenced the query; they may return more 609 - Hit Makers are added by marking Spans as hits (add `@highlight` attribute); multiple hit-makers are supported and Aggregator may display them visually distinct 610 - where to add hit markers is up to the endpoint; generally "things" that where referenced in the query should be marked. 611 474 612 475 613 Example: a sentence interpreted as a character stream 476 477 ||=Data =|| t || || d || a || || ' || s || || d || e || || e || n || i || g || e || || e || c || h || t || e || || h || o || o || p || || v || o || o || r || || o || n || s || || m || e || n || s || e || n || 614 ||=Data =|| t || || d || a || || ' || s || || d || e || || e || n || i || g || e || || e || c || h || t || e || || h || o || o || p || || v || o || o || r || || o || n || s || || m || e || n || s || e || n || 478 615 ||=Offset =|| 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 || 12 || 13 || 14 || 15 || 16 || 17 || 18 || 19 || 20 || 21 || 22 || 23 || 24 || 25 || 26 || 27 || 28 || 29 || 30 || 31 || 32 || 33 || 34 || 35 || 36 || 37 || 38 || 39 || 40 || 41 || 42 || 43 || 479 616 480 617 Example: several annotation layers for the sentence 481 482 ||=Offset (Start, End) =|| 1,1 || 3,4 || 6,7 || 9,10 || 12,16 || 18,22 || 24,27 || 29,32 || 34,36 || 38,43 || 483 ||=Layer ''orth'' =|| t || da || 's || de || enige || echte || hoop || voor || ons || mensen || 484 ||=Layer ''pos'' =|| X || PRON || VERB || DET || DET || ADJ || NOUN || ADP || PRON || NOUN || 485 ||=Layer ''lemma'' =|| _ || dat || zijn || de || enig || echt || hoop || voor || ons || mens || 486 ||=Layer ''phonetic'' =|| t@ || dAz || dAz || d@ || en@G@ || Ext@ || hop || for || Ons || mEns@ || 487 488 Example: XML serialization {{{#!xml <Advanced> 489 490 <Segments unit="items"> 491 <Segment id="s1" start="1" end="1" 492 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=0:173"/ > 493 <Segment id="s2" start="3" end="4" 494 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/ > 495 <Segment id="s3" start="6" end="7" 496 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/ > 497 <Segment id="s4" start="9" end="10" 498 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=304:480"/ > 499 <Segment id="s5" start="12" end="16" 500 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=480:1119"/ > 501 <Segment id="s6" start="18" end="22" 502 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1339:1901"/ > 503 <Segment id="s7" start="24" end="27" 504 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1901:2427"/ > 505 <Segment id="s8" start="29" end="32" 506 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3084:3493"/ > 507 <Segment id="s9" start="34" end="36" 508 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3493:3754"/ > 509 <Segment id="s10" start="38" end="43" 510 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3754:4274"/ > 511 </Segments> 512 513 <Layers> 514 <Layer id="http://endpoint.example.org/Layers/orth1 "> 515 <Span ref="s1">t</Span> <Span ref="s2">da</Span> <Span ref="s3">'s</Span> <Span ref="s4">de</Span> <Span ref="s5">enige</Span> <Span ref="s6">echte</Span> <Span ref="s7">hoop</Span> <Span ref="s8">voor</Span> <Span ref="s9">ons</Span> <Span ref="s10">mensen</Span> 516 </Layer> 517 518 <Layer id="http://endpoint.example.org/Layers/pos1 "> 519 <Span ref="s1" alt-value="SPEC(afgebr)">X</Span> <Span ref="s2" alt-value="VNW(aanw,pron,stan,vol,3o,ev)">PRON</Span> <Span ref="s3" alt-value="WW(pv,tgw,ev)">VERB</Span> <Span ref="s4" alt-value="LID(bep,stan,rest)">DET</Span> <Span ref="s5" alt-value="VNW(onbep,det,stan,prenom,met-e,rest)">DET</Span> <Span ref="s6" alt-value="ADJ(prenom,basis,met-e,stan)">ADJ</Span> <Span ref="s7" alt-value="N(soort,ev,basis,zijd,stan)">NOUN</Span> <Span ref="s8" alt-value="VZ(init)">ADP</Span> <Span ref="s9" alt-value="VNW(pr,pron,obl,vol,1,mv)">PRON</Span> <Span ref="s10" alt-value="N(soort,mv,basis)">NOUN</Span> 520 </Layer> 521 522 <Layer id="http://endpoint.example.org/Layers/lemma1 "> 523 <Span ref="s1">_</Span> <Span ref="s2">dat</Span> <Span ref="s3">zijn</Span> <Span ref="s4" >de</Span> <Span ref="s5">enig</Span> <Span ref="s6" highlight="h1">echt</Span> <Span ref="s7" highlight="h1">hoop</Span> <Span ref="s8">voor</Span> <Span ref="s9">ons</Span> <Span ref="s10">mens</Span> 524 </Layer> 525 526 <Layer id="http://endpoint.example.org/Layers/phon "> 527 <Span ref="s1">t@</Span> <Span ref="s2" highlight="h2">dAz</Span> <Span ref="s3">dAz</Span> <Span ref="s4">d@</Span> <Span ref="s5">en@G@</Span> <Span ref="s6">Ext@</Span> <Span ref="s7">hop</Span> <Span ref="s8">for</Span> <Span ref="s9">Ons</Span> <Span ref="s10">mEns@</Span> 528 </Layer> 529 530 </Layers> </Advanced> }}} 531 532 === Versioning and Extensions === 533 ==== Backwards Compatibility ==== #backwardsCompatibility 618 ||=Offset (Start, End) =|| 1,1 || 3,4 || 6,7 || 9,10 || 12,16 || 18,22 || 24,27 || 29,32 || 34,36 || 38,43 || 619 ||=Layer ''orth'' =|| t || da || 's || de || enige || echte || hoop || voor || ons || mensen || 620 ||=Layer ''pos'' =|| X || PRON || VERB || DET || DET || ADJ || NOUN || ADP || PRON || NOUN || 621 ||=Layer ''lemma'' =|| _ || dat || zijn || de || enig || echt || hoop || voor || ons || mens || 622 ||=Layer ''phonetic'' =|| t@ || dAz || dAz || d@ || en@G@ || Ext@ || hop || for || Ons || mEns@ || 623 624 Example: XML serialization 625 {{{#!xml 626 <Advanced> 627 <Segments unit="items"> 628 <Segment id="s1" start="1" end="1" 629 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=0:173"/> 630 <Segment id="s2" start="3" end="4" 631 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/> 632 <Segment id="s3" start="6" end="7" 633 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=173:304"/> 634 <Segment id="s4" start="9" end="10" 635 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=304:480"/> 636 <Segment id="s5" start="12" end="16" 637 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=480:1119"/> 638 <Segment id="s6" start="18" end="22" 639 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1339:1901"/> 640 <Segment id="s7" start="24" end="27" 641 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=1901:2427"/> 642 <Segment id="s8" start="29" end="32" 643 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3084:3493"/> 644 <Segment id="s9" start="34" end="36" 645 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3493:3754"/> 646 <Segment id="s10" start="38" end="43" 647 ref="http://hdl.handle.net/4711/123456789?urlappend=%3Fplay=3754:4274"/> 648 </Segments> 649 650 <Layers> 651 <Layer id="http://endpoint.example.org/Layers/orth1"> 652 <Span ref="s1">t</Span> 653 <Span ref="s2">da</Span> 654 <Span ref="s3">'s</Span> 655 <Span ref="s4">de</Span> 656 <Span ref="s5">enige</Span> 657 <Span ref="s6">echte</Span> 658 <Span ref="s7">hoop</Span> 659 <Span ref="s8">voor</Span> 660 <Span ref="s9">ons</Span> 661 <Span ref="s10">mensen</Span> 662 </Layer> 663 664 <Layer id="http://endpoint.example.org/Layers/pos1"> 665 <Span ref="s1" alt-value="SPEC(afgebr)">X</Span> 666 <Span ref="s2" alt-value="VNW(aanw,pron,stan,vol,3o,ev)">PRON</Span> 667 <Span ref="s3" alt-value="WW(pv,tgw,ev)">VERB</Span> 668 <Span ref="s4" alt-value="LID(bep,stan,rest)">DET</Span> 669 <Span ref="s5" alt-value="VNW(onbep,det,stan,prenom,met-e,rest)">DET</Span> 670 <Span ref="s6" alt-value="ADJ(prenom,basis,met-e,stan)">ADJ</Span> 671 <Span ref="s7" alt-value="N(soort,ev,basis,zijd,stan)">NOUN</Span> 672 <Span ref="s8" alt-value="VZ(init)">ADP</Span> 673 <Span ref="s9" alt-value="VNW(pr,pron,obl,vol,1,mv)">PRON</Span> 674 <Span ref="s10" alt-value="N(soort,mv,basis)">NOUN</Span> 675 </Layer> 676 677 <Layer id="http://endpoint.example.org/Layers/lemma1"> 678 <Span ref="s1">_</Span> 679 <Span ref="s2">dat</Span> 680 <Span ref="s3">zijn</Span> 681 <Span ref="s4" >de</Span> 682 <Span ref="s5">enig</Span> 683 <Span ref="s6" highlight="h1">echt</Span> 684 <Span ref="s7" highlight="h1">hoop</Span> 685 <Span ref="s8">voor</Span> 686 <Span ref="s9">ons</Span> 687 <Span ref="s10">mens</Span> 688 </Layer> 689 690 <Layer id="http://endpoint.example.org/Layers/phon"> 691 <Span ref="s1">t@</Span> 692 <Span ref="s2" highlight="h2">dAz</Span> 693 <Span ref="s3">dAz</Span> 694 <Span ref="s4">d@</Span> 695 <Span ref="s5">en@G@</Span> 696 <Span ref="s6">Ext@</Span> 697 <Span ref="s7">hop</Span> 698 <Span ref="s8">for</Span> 699 <Span ref="s9">Ons</Span> 700 <Span ref="s10">mEns@</Span> 701 </Layer> 702 </Layers> 703 </Advanced> 704 }}} 705 706 === Versioning and Extensions 707 ==== Backwards Compatibility #backwardsCompatibility 534 708 {{{ 535 709 #!div style="border: 1px solid #000000; font-size: 75%" 536 710 TODO: check and proof-read 537 711 }}} 712 538 713 Clients `MUST` be compatible to CLARIN-FCS 1.0, thus must implement SRU 1.2. If a Client uses CLARIN-FCS 1.0 to talk to an Endpoint, it `MUST NOT` use features beyond the Basic Search capability. Clients `MUST` implement a heuristic to automatically determine which CLARIN-FCS protocol version, i.e. which version of the SRU protocol, can be used talk an Endpoint. 539 714 540 715 Clients `MUST` be able to process the legacy XML namespaces: 541 542 716 * `http://www.loc.gov/zing/srw/` for SRU response documents, and 543 717 * `http://www.loc.gov/zing/srw/diagnostic/` for diagnostics within SRU response documents. 544 545 718 which SRU 1.2 Endpoints use for serializing responses as well as the OASIS XML namespaces. CLARIN-FCS deviates from the OASIS specification [#REF_SRU_Overview OASIS-SRU-Overview] and [#REF_SRU_12 OASIS-SRU-12] to ensure backwards comparability with SRU 1.2 services as they were defined by the [#REF_LOC_SRU_12 LOC-SRU12]. 546 719 547 720 Pseudo algorithm for version detection heuristic: 548 549 721 * Send ''explain'' request without `version` and `operation` parameter 550 722 * Check SRU response for content of the element `<sru:explainResponse>/<sru:version>` 551 723 552 ==== Endpoint Custom Extensions ====553 Endpoints can add custom extensions, i.e. custom data, to the Result Format. This extension mechanism can for example be used to provide hints for an (XSLT/XQuery) application that works directly on CLARIN-FCS, e.g. to allow it to generate back and forward links to navigate in a result set. 724 ==== Endpoint Custom Extensions 725 Endpoints can add custom extensions, i.e. custom data, to the Result Format. This extension mechanism can for example be used to provide hints for an (XSLT/XQuery) application that works directly on CLARIN-FCS, e.g. to allow it to generate back and forward links to navigate in a result set. 554 726 555 727 An Endpoint `MAY` add arbitrary XML fragments to the extension hooks provided in the `<fcs:Resource>` element (see the XML schema for "Resource.xsd"). The XML fragment for the extension `MUST` use a custom XML namespace name for the extension. Endpoints `MUST NOT` use XML namespace names that start with the prefixes `http://clarin.eu`, `http://www.clarin.eu/`, `https://clarin.eu` or `https://www.clarin.eu/`. … … 559 731 The non-normative appendix contains an [#extensionExample example], how an extension could be implemented. 560 732 561 = CLARIN-FCS to SRU/CQL binding =562 == SRU/CQL ==#sruCQL733 = CLARIN-FCS to SRU/CQL binding 734 == SRU/CQL #sruCQL 563 735 {{{ 564 736 #!div style="border: 1px solid #000000; font-size: 75%" … … 568 740 569 741 Endpoints and Clients `MUST` implement the SRU/CQL protocol suite as defined in [#REF_SRU_Overview OASIS-SRU-Overview], [#REF_SRU_APD OASIS-SRU-APD], [#REF_CQL OASIS-CQL], [#REF_Explain SRU-Explain], [#REF_Scan SRU-Scan], especially with respect to: 570 571 742 * Data Model, 572 743 * Query Model, 573 744 * Processing Model, 574 745 * Result Set Model, and 575 * Diagnostics Model 576 577 Endpoints and Clients `MUST` implement the APD Binding for SRU 2.0, as defined in [#REF_SRU_20 OASIS-SRU-20]. \\ Clients `MUST` implement APD Binding for SRU 1.2, as defined in [#REF_SRU_12 OASIS-SRU-12]. \\ Clients `MAY` also implement APD binding for version 1.1. \\ '''NOTE''': when implementing SRU 1.2 Endpoints and Clients `MUST` behave like described in the section [#backwardsCompatibility Backwards Compatibility]. 746 * Diagnostics Model 747 748 Endpoints and Clients `MUST` implement the APD Binding for SRU 2.0, as defined in [#REF_SRU_20 OASIS-SRU-20]. \\ 749 Clients `MUST` implement APD Binding for SRU 1.2, as defined in [#REF_SRU_12 OASIS-SRU-12]. \\ 750 Clients `MAY` also implement APD binding for version 1.1. \\ 751 '''NOTE''': when implementing SRU 1.2 Endpoints and Clients `MUST` behave like described in the section [#backwardsCompatibility Backwards Compatibility]. 578 752 579 753 Endpoints or Clients `MUST` support CQL conformance ''Level 2'' (as defined in [#REF_OASIS_CQL OASIS-CQL, section 6]), i.e. be able to ''parse'' (Endpoints) or ''serialize'' (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface. … … 585 759 Endpoints `MUST` support the HTTP GET [#REF_SRU_20 OASIS-SRU-20, Appendix B.1] and HTTP POST [#REF_SRU_20 OASIS-SRU-20, Appendix B.2] lower level protocol binding. Endpoints `MAY` also support the SOAP [#REF_SRU_20 OASIS-SRU-20, Appendix B.3] binding. 586 760 587 == Operation ''explain'' == #explain 761 762 == Operation ''explain'' #explain 588 763 {{{ 589 764 #!div style="border: 1px solid #000000; font-size: 75%" … … 595 770 596 771 According to the Capabilities supported by the Endpoint the Explain record `MUST` contain the following elements: 597 598 ''Basic-Search'' Capability:: `<zr:serverInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ `<zr:databaseInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ `<zr:schemaInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`). This element `MUST` contain an element `<zr:schema>` with an `@identifier` attribute with a value of `http://clarin.eu/fcs/resource` and an `@name` attribute with a value of `fcs`. \\ `<zr:configInfo>` is `OPTIONAL` \\ Other capabilities may define how the `<zr:indexInfo>` element is to be used, therefore it is `NOT RECOMMENDED` for Endpoints to use it in custom extensions. 772 ''Basic-Search'' Capability:: 773 `<zr:serverInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ 774 `<zr:databaseInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`) \\ 775 `<zr:schemaInfo>` as defined in [#REF_Explain SRU-Explain] (`REQUIRED`). This element `MUST` contain an element `<zr:schema>` with an `@identifier` attribute with a value of `http://clarin.eu/fcs/resource` and an `@name` attribute with a value of `fcs`. \\ 776 `<zr:configInfo>` is `OPTIONAL` \\ 777 Other capabilities may define how the `<zr:indexInfo>` element is to be used, therefore it is `NOT RECOMMENDED` for Endpoints to use it in custom extensions. 599 778 600 779 To support auto-configuration in CLARIN-FCS, the Endpoint `MUST` provide support ''Endpoint Description''. The Endpoint Description is included in explain response utilizing SRUs extension mechanism, i.e. by embedding an XML fragment into the `<sru:extraResponseData>` element. The Endpoint `MUST` include the Endpoint Description ''only'' if the Client performs an explain request with the ''extra request parameter'' `x-fcs-endpoint-description` with a value of `true`. If the Client performs an explain request ''without'' supplying this extra request parameter the Endpoint `MUST NOT` include the Endpoint Description. The format of the Endpoint Description XML fragment is defined in [#endpointDescription Endpoint Description]. 601 780 602 781 The following example shows a request and response to an ''explain'' request with added extra request parameter `x-fcs-endpoint-description`: 603 604 * HTTP GET request: Client → Endpoint: 605 606 {{{#!sh http://repos.example.org/fcs-endpoint?operation=explain&version=1.2&x-fcs-endpoint-description=true }}} 607 608 * HTTP Response: Endpoint → Client: 609 610 {{{#!xml <?xml version='1.0' encoding='utf-8'?> <sru:explainResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 611 612 <sru:version> 1.2</sru:version > <sru:record> 613 <sru:recordSchema> http://explain.z3950.org/dtd/2.0/ </sru:recordSchema > <sru:recordPacking> xml</sru:recordPacking > <sru:recordData> 782 * HTTP GET request: Client → Endpoint: 783 {{{#!sh 784 http://repos.example.org/fcs-endpoint?operation=explain&version=1.2&x-fcs-endpoint-description=true 785 }}} 786 * HTTP Response: Endpoint → Client: 787 {{{#!xml 788 <?xml version='1.0' encoding='utf-8'?> 789 <sru:explainResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 790 <sru:version>1.2</sru:version> 791 <sru:record> 792 <sru:recordSchema>http://explain.z3950.org/dtd/2.0/</sru:recordSchema> 793 <sru:recordPacking>xml</sru:recordPacking> 794 <sru:recordData> 614 795 <zr:explain xmlns:zr="http://explain.z3950.org/dtd/2.0/"> 615 <!-- <zr:serverInfo> is REQUIRED --> <zr:serverInfo protocol="SRU" version="1.2" transport="http"> 616 <zr:host> repos.example.org</zr:host > <zr:port> 80</zr:port > <zr:database> fcs-endpoint</zr:database > 617 </zr:serverInfo > <!-- <zr:databaseInfo> is REQUIRED --> <zr:databaseInfo> 618 <zr:title lang="de"> Goethe Corpus</zr:title > <zr:title lang="en" primary="true"> Goethe Korpus</zr:title > <zr:description lang="de"> Der Goethe Korpus des IDS Mannheim.</zr:description > <zr:description lang="en" primary="true"> The Goethe corpus of IDS Mannheim.</zr:description > 619 </zr:databaseInfo > <!-- <zr:schemaInfo> is REQUIRED --> <zr:schemaInfo> 796 <!-- <zr:serverInfo > is REQUIRED --> 797 <zr:serverInfo protocol="SRU" version="1.2" transport="http"> 798 <zr:host>repos.example.org</zr:host> 799 <zr:port>80</zr:port> 800 <zr:database>fcs-endpoint</zr:database> 801 </zr:serverInfo> 802 <!-- <zr:databaseInfo> is REQUIRED --> 803 <zr:databaseInfo> 804 <zr:title lang="de">Goethe Corpus</zr:title> 805 <zr:title lang="en" primary="true">Goethe Korpus</zr:title> 806 <zr:description lang="de">Der Goethe Korpus des IDS Mannheim.</zr:description> 807 <zr:description lang="en" primary="true">The Goethe corpus of IDS Mannheim.</zr:description> 808 </zr:databaseInfo> 809 <!-- <zr:schemaInfo> is REQUIRED --> 810 <zr:schemaInfo> 620 811 <zr:schema identifier="http://clarin.eu/fcs/resource" name="fcs"> 621 <zr:title lang="en" primary="true"> CLARIN Federated Content Search</zr:title > 622 </zr:schema > 623 </zr:schemaInfo > <!-- <zr:configInfo> is OPTIONAL --> <zr:configInfo> 624 <zr:default type="numberOfRecords"> 250</zr:default > <zr:setting type="maximumRecords"> 1000</zr:setting > 625 </zr:configInfo > 626 </zr:explain > 627 </sru:recordData > 628 </sru:record > <!-- <sru:echoedExplainRequest> is OPTIONAL --> <sru:echoedExplainRequest> 629 <sru:version> 1.2</sru:version > <sru:baseUrl> http://repos.example.org/fcs-endpoint </sru:baseUrl > 630 </sru:echoedExplainRequest > <sru:extraResponseData> 812 <zr:title lang="en" primary="true">CLARIN Federated Content Search</zr:title> 813 </zr:schema> 814 </zr:schemaInfo> 815 <!-- <zr:configInfo> is OPTIONAL --> 816 <zr:configInfo> 817 <zr:default type="numberOfRecords">250</zr:default> 818 <zr:setting type="maximumRecords">1000</zr:setting> 819 </zr:configInfo> 820 </zr:explain> 821 </sru:recordData> 822 </sru:record> 823 <!-- <sru:echoedExplainRequest> is OPTIONAL --> 824 <sru:echoedExplainRequest> 825 <sru:version>1.2</sru:version> 826 <sru:baseUrl>http://repos.example.org/fcs-endpoint</sru:baseUrl> 827 </sru:echoedExplainRequest> 828 <sru:extraResponseData> 631 829 <ed:EndpointDescription xmlns:ed="http://clarin.eu/fcs/endpoint-description" version="1"> 632 830 <ed:Capabilities> 633 <ed:Capability> http://clarin.eu/fcs/capability/basic-search </ed:Capability > 634 </ed:Capabilities > <ed:SupportedDataViews> 635 <ed:SupportedDataView id="hits" delivery-policy="send-by-default"> application/x-clarin-fcs-hits+xml</ed:SupportedDataView > 636 </ed:SupportedDataViews > <ed:Resources> 637 <!-- just one top-level resource at the Endpoint --> <ed:Resource pid="http://hdl.handle.net/4711/0815"> 638 <ed:Title xml:lang="de"> Goethe Corpus</ed:Title > <ed:Title xml:lang="en"> Goethe Korpus</ed:Title > <ed:Description xml:lang="de"> Der Goethe Korpus des IDS Mannheim.</ed:Description > <ed:Description xml:lang="en"> The Goethe corpus of IDS Mannheim.</ed:Description > <ed:LandingPageURI> http://repos.example.org/corpus1.html </ed:LandingPageURI > <ed:Languages> 639 <ed:Language> deu</ed:Language > 640 </ed:Languages > <ed:AvailableDataViews ref="hits"/> 641 </ed:Resource > 642 </ed:Resources > 643 </ed:EndpointDescription > 644 </sru:extraResponseData > 645 646 </sru:explainResponse> }}} 647 648 == Operation ''scan'' == #scan 831 <ed:Capability>http://clarin.eu/fcs/capability/basic-search</ed:Capability> 832 </ed:Capabilities> 833 <ed:SupportedDataViews> 834 <ed:SupportedDataView id="hits" delivery-policy="send-by-default">application/x-clarin-fcs-hits+xml</ed:SupportedDataView> 835 </ed:SupportedDataViews> 836 <ed:Resources> 837 <!-- just one top-level resource at the Endpoint --> 838 <ed:Resource pid="http://hdl.handle.net/4711/0815"> 839 <ed:Title xml:lang="de">Goethe Corpus</ed:Title> 840 <ed:Title xml:lang="en">Goethe Korpus</ed:Title> 841 <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description> 842 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 843 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> 844 <ed:Languages> 845 <ed:Language>deu</ed:Language> 846 </ed:Languages> 847 <ed:AvailableDataViews ref="hits"/> 848 </ed:Resource> 849 </ed:Resources> 850 </ed:EndpointDescription> 851 </sru:extraResponseData> 852 </sru:explainResponse> 853 }}} 854 855 == Operation ''scan'' #scan 649 856 The ''scan'' operation of the SRU protocol is currently not used in the ''Basic Search'' or ''Advanced Search'' capability of CLARIN-FCS. Future capabilities may use this operation, therefore it is `NOT RECOMMENDED` for Endpoints to define custom extensions that use this operation. 650 857 651 == Operation ''searchRetrieve'' ==#searchRetrieve858 == Operation ''searchRetrieve'' #searchRetrieve 652 859 {{{ 653 860 #!div style="border: 1px solid #000000; font-size: 75%" … … 657 864 The ''searchRetrieve'' operation of the SRU protocol is used for searching in the Resources that are provided by the Endpoint. The SRU protocol defines the serialization of request and response formats in [#REF_SRU_20 OASIS-SRU-20] for SRU version 2.0 and [#REF_SRU_12 OASIS-SRU-12] for SRU version 1.2. An Endpoint `MUST` respond in the correct format, i.e. when Endpoint also supports SRU 1.2 and the request is issued in SRU version 1.2, the response must be encoded accordingly. 658 865 659 In SRU, search result hits are encoded down to a record level, i.e. the `<sru:record>` element, and SRU allows records to be serialized in various formats, so called ''record schemas'' Endpoints `MUST` support the CLARIN-FCS record schema (see section [#resultFormat Result Format]) and `MUST` use the value `http://clarin.eu/fcs/resource` for the ''responseItemType'' ("record schema identifier"). Endpoints `MUST` represent exactly ''one hit'' within the Resource as one SRU record, i.e. `<sru:record>` element. 866 In SRU, search result hits are encoded down to a record level, i.e. the `<sru:record>` element, and SRU allows records to be serialized in various formats, so called ''record schemas'' Endpoints `MUST` support the CLARIN-FCS record schema (see section [#resultFormat Result Format]) and `MUST` use the value `http://clarin.eu/fcs/resource` for the ''responseItemType'' ("record schema identifier"). 867 Endpoints `MUST` represent exactly ''one hit'' within the Resource as one SRU record, i.e. `<sru:record>` element. 660 868 661 869 The following example shows a request and response to a ''searchRetrieve'' request with a ''term-only'' query for "cat": 662 663 * HTTP GET request: Client → Endpoint: 664 665 {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat }}} 666 667 * HTTP Response: Endpoint → Client: 668 669 {{{#!xml <?xml version='1.0' encoding='utf-8'?> <sru:searchRetrieveResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 670 671 <sru:version> 1.2</sru:version > <sru:numberOfRecords> 6</sru:numberOfRecords > <sru:records> 870 * HTTP GET request: Client → Endpoint: 871 {{{#!sh 872 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat 873 }}} 874 * HTTP Response: Endpoint → Client: 875 {{{#!xml 876 <?xml version='1.0' encoding='utf-8'?> 877 <sru:searchRetrieveResponse xmlns:sru="http://www.loc.gov/zing/srw/"> 878 <sru:version>1.2</sru:version> 879 <sru:numberOfRecords>6</sru:numberOfRecords> 880 <sru:records> 672 881 <sru:record> 673 <sru:recordSchema> http://clarin.eu/fcs/resource </sru:recordSchema > <sru:recordPacking> xml</sru:recordPacking > <sru:recordData> 882 <sru:recordSchema>http://clarin.eu/fcs/resource</sru:recordSchema> 883 <sru:recordPacking>xml</sru:recordPacking> 884 <sru:recordData> 674 885 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/08-15"> 675 886 <fcs:ResourceFragment> 676 887 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 677 888 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 678 The quick brown <hits:Hit> cat</hits:Hit > jumps over the lazy dog. 679 </hits:Result > 680 </fcs:DataView > 681 </fcs:ResourceFragment > 682 </fcs:Resource > 683 </sru:recordData > <sru:recordPosition> 1</sru:recordPosition > 684 </sru:record > <!-- more <sru:records> omitted for brevity --> 685 </sru:records > <!-- <sru:echoedSearchRetrieveRequest> is OPTIONAL --> <sru:echoedSearchRetrieveRequest> 686 <sru:version> 1.2</sru:version > <sru:query> cat</sru:query > <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/"> 889 The quick brown <hits:Hit>cat</hits:Hit> jumps over the lazy dog. 890 </hits:Result> 891 </fcs:DataView> 892 </fcs:ResourceFragment> 893 </fcs:Resource> 894 </sru:recordData> 895 <sru:recordPosition>1</sru:recordPosition> 896 </sru:record> 897 <!-- more <sru:records> omitted for brevity --> 898 </sru:records> 899 <!-- <sru:echoedSearchRetrieveRequest> is OPTIONAL --> 900 <sru:echoedSearchRetrieveRequest> 901 <sru:version>1.2</sru:version> 902 <sru:query>cat</sru:query> 903 <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/"> 687 904 <searchClause> 688 <index>cql.serverChoice</index> <relation> 905 <index>cql.serverChoice</index> 906 <relation> 689 907 <value>=</value> 690 </relation> <term>cat</term> 908 </relation> 909 <term>cat</term> 691 910 </searchClause> 692 </sru:xQuery > <sru:startRecord> 1</sru:startRecord > <sru:baseUrl> http://repos.example.org/fcs-endpoint </sru:baseUrl > 693 </sru:echoedSearchRetrieveRequest > 694 695 </sru:searchRetrieveResponse> }}} 911 </sru:xQuery> 912 <sru:startRecord>1</sru:startRecord> 913 <sru:baseUrl>http://repos.example.org/fcs-endpoint</sru:baseUrl> 914 </sru:echoedSearchRetrieveRequest> 915 </sru:searchRetrieveResponse> 916 }}} 696 917 697 918 In general, the Endpoint is `REQUIRED` to accept an ''unrestricted search'' and `SHOULD` perform the search operation on ''all'' Resources that are available at the Endpoint. If that is for some reason not feasible, e.g. performing an unrestricted search would allocate too many resources, the Endpoint `MAY` independently restrict the search to a scope that it can handle. If it does so, it `MUST` issue a non-fatal diagnostics `http://clarin.eu/fcs/diagnostic/2` ("Resource set too large. Query context automatically adjusted."). The details field of diagnostics `MUST` contain the persistent identifier of the resources to which the query scope was limited to. If the Endpoint limits the query scope to more than one resource, it `MUST` generate a ''separate'' non-fatal diagnostic `http://clarin.eu/fcs/diagnostic/2` for each of the resources. … … 701 922 The Client can extract all valid persistent identifiers from the `@pid` attribute of the `<ed:Resource>` element, obtained by the ''explain'' request (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). The list of persistent identifiers can get extensive, but a Client can use the HTTP POST method instead of HTTP GET method for submitting the request. 702 923 703 For example, to restrict the search to the Resource with the persistent identifier `http://hdl.handle.net/4711/0815` the Client must issue the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815 }}} To restrict the search to the Resources with the persistent identifier `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816-2` the Client must issue the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815,http://hdl.handle.net/4711/0816-2 }}} If an invalid persistent identifier is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/1` diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search, or it `MAY` treat it as non-fatal and perform the search. 704 705 If a Client wants to request one or more Data Views, that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). 706 707 For example, to request the CMDI Data View from an Endpoint that has an Endpoint Description, as described in [#REF_Example_5 Example 5], a Client would need to use the ''Data View identifier'' `cmdi` and submit the following request: {{{#!sh http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-dataviews=cmdi }}} If an invalid ''Data View identifier'' is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/4`diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. simply issue the diagnostic and perform no search, or it `MAY` treat it a non-fatal and perform the search. 708 709 = Normative Appendix = 924 For example, to restrict the search to the Resource with the persistent identifier `http://hdl.handle.net/4711/0815` the Client must issue the following request: 925 {{{#!sh 926 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815 927 }}} 928 To restrict the search to the Resources with the persistent identifier `http://hdl.handle.net/4711/0815` and `http://hdl.handle.net/4711/0816-2` the Client must issue the following request: 929 {{{#!sh 930 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-context=http://hdl.handle.net/4711/0815,http://hdl.handle.net/4711/0816-2 931 }}} 932 If an invalid persistent identifier is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/1` diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search, or it `MAY` treat it as non-fatal and perform the search. 933 934 If a Client wants to request one or more Data Views, that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#explain Operation ''explain''] and section [#endpointDescription Endpoint Description]). 935 936 For example, to request the CMDI Data View from an Endpoint that has an Endpoint Description, as described in [#REF_Example_5 Example 5], a Client would need to use the ''Data View identifier'' `cmdi` and submit the following request: 937 {{{#!sh 938 http://repos.example.org/fcs-endpoint?operation=searchRetrieve&version=1.2&query=cat&x-fcs-dataviews=cmdi 939 }}} 940 If an invalid ''Data View identifier'' is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/4`diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. simply issue the diagnostic and perform no search, or it `MAY` treat it a non-fatal and perform the search. 941 942 943 = Normative Appendix 710 944 {{{ 711 945 #!div style="border: 1px solid #000000; font-size: 75%" 712 946 TODO: check and proof-read all sub-sections. 713 947 }}} 714 == List of extra request parameters ==948 == List of extra request parameters 715 949 The following extra request parameters are used in CLARIN-FCS. The column ''SRU operations'' lists the SRU operation, for which this extra request parameter is to be used. Clients `MUST NOT` use the parameter for an operation that is not listed in this column. However, if a Client sends an invalid parameter, an Endpoint `SHOULD` issue a fatal diagnostic "Unsupported Parameter" (`info:srw/diagnostic/1/8`) and stop processing the request. Alternatively, an Endpoint `MAY` silently ignore the invalid parameter. 716 717 ||=Parameter Name =||=SRU operations =||=Allowed values =||=Description =|| 950 ||=Parameter Name =||=SRU operations =||=Allowed values =||=Description =|| 718 951 || `x-fcs-endpoint-description` || explain || `true`; all other values are reserved and `MUST` not be used by Clients || If the parameter is given (with the value `true`), the Endpoint `MUST` include an Endpoint Description in the `<sru:extraResponseData>` element of the ''explain'' response. || 719 952 || `x-fcs-context` || searchRetrieve || A comma-separated list of persistent identifiers || The Endpoint `MUST` restrict the search to the resources identified by the persistent identifiers. || … … 721 954 || `x-fcs-rewrites-allowed` || searchRetrieve || `true`; all other values are reserved and `MUST` not be used by Clients. \\ Clients `MUST` only use this parameter when performing an Advanced Search request. || If the parameter is given (with the value `true`), the Endpoint `MAY` rewrite the query to a simpler query to allow for more recall. || 722 955 723 == List of diagnostics ==956 == List of diagnostics 724 957 {{{ 725 958 #!div style="border: 1px solid #000000; font-size: 75%" … … 727 960 }}} 728 961 Apart from the SRU diagnostics defined in [#REF_SRU_12 OASIS-SRU-12, Appendix C] and [#REF_LOC_DIAG LOC-DIAG], the following diagnostics are used in CLARIN-FCS. The column "Details Format" specifies what `SHOULD` be returned in the details field. If this column is blank, the format is "undefined" and the Endpoint `MAY` return whatever it feels appropriate, including nothing. The column "Impact" specifies, if the endpoint should continue ("non-fatal") or should stop ("fatal") processing. 729 730 ||=Identifier URI =||=Description =||=Details Format =||=Impact =||=Note =|| 962 ||=Identifier URI =||=Description =||=Details Format =||=Impact =||=Note =|| 731 963 || `http://clarin.eu/fcs/diagnostic/1` || Persistent identifier passed by the Client for restricting the search is invalid. || The offending persistent identifier. || non-fatal || If more than one invalid persistent identifiers were submitted by the Client, the Endpoint `MUST` generate a separate diagnostic for each invalid persistent identifier. || 732 964 || `http://clarin.eu/fcs/diagnostic/2` || Resource set too large. Query context automatically adjusted. || The persistent identifier of the resource to which the query context was adjusted. || non-fatal || If an Endpoint limited the query context to more than one resource, it `MUST` generate a separate diagnostic for each resource to which the query context was adjusted. || 733 965 || `http://clarin.eu/fcs/diagnostic/3` || Resource set too large. Cannot perform Query. || || fatal || || 734 966 || `http://clarin.eu/fcs/diagnostic/4` || Requested Data View not valid for this resource. || The Data View MIME type. || non-fatal || If more than one invalid Data View was requested, the Endpoint `MUST` generate a separate diagnostic for each invalid Data View. || 735 || `http://clarin.eu/fcs/diagnostic/10` || General query syntax error. || Detailed error message why the query could not be parsed. || fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. ||967 || `http://clarin.eu/fcs/diagnostic/10` || General query syntax error. || Detailed error message why the query could not be parsed. || fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. || 736 968 || `http://clarin.eu/fcs/diagnostic/11` || Query too complex. Cannot perform Query. || Details why could not be performed, e.g. unsupported layer or unsupported combination of operators. || fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. || 737 969 || `http://clarin.eu/fcs/diagnostic/12` || Query was rewritten. || Details how the query was rewritten. || non-fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request with the `x-fcs-rewrites-allowed` request parameter. || 738 970 || `http://clarin.eu/fcs/diagnostic/14` || General processing hint. || E.g. "No matches, because layer 'XY' is not available in your selection of resources" || non-fatal || Endpoints `MUST` use this diagnostic only if the Client performed an Advanced Search request. || 739 971 740 == CLARIN FCS-QL Grammar Specification ==#fcsQLEBNF972 == CLARIN FCS-QL Grammar Specification #fcsQLEBNF 741 973 The version of the CLARIN FCS-QL is tied to the FCS Core version starting with version 2.0. 742 974 743 The grammar specification for the FCS-QL is heavily based on Poliqarp but also with inspiration from other query languages' grammars. An unqualified or qualified "attribute" denotes the annotation layer to be used, e.g. unqualified "word", "lemma", "pos" or qualified "pos:stts". Default is "text" for compatibility with FCS 1.0 where simple wordforms in a pair of single or double quotes can be matched. 975 The grammar specification for the FCS-QL is heavily based on Poliqarp but also with inspiration from other query languages' grammars. 976 An unqualified or qualified "attribute" denotes the annotation layer to be used, e.g. unqualified "word", "lemma", "pos" or qualified "pos:stts". Default is "text" for compatibility with FCS 1.0 where simple wordforms in a pair of single or double quotes can be matched. 744 977 745 978 === FCS-QL EBNF === 746 979 {{{#!comment 747 748 980 Please keep the EBNF nicely formatted. Thanks! 749 750 }}} 751 981 }}} 752 982 {{{ 753 983 [1] query ::= main-query within-part? … … 858 1088 === Notes === 859 1089 * "simple-within-scope": possible values for scope 860 * "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.")1090 * "sentence", "s", "utterance", "u": denote a matching scope of something like a sentence or utterance. provides compatibility with FCS 1.0 ("Generic Hits", "Each hit SHOULD be presented within the context of a complete sentence.") 861 1091 * "paragraph" | "p" | "turn" | "t": denote the next larger unit, e.g. something like a paragraph 862 1092 * "article" | "session": something like a whole document 863 * `[25]` and `[26]` "any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with `[27]`1093 * {{{[25]}}} and {{{[26]}}} "any $SOMETING codepoint" are a pain to get easily done in at least ANTLR and JavaCC. Especially in combination with {{{[27]}}} 864 1094 * regex are not defined/guarded by this grammar 865 1095 866 = Non-normative Appendix =1096 = Non-normative Appendix 867 1097 {{{ 868 1098 #!div style="border: 1px solid #000000; font-size: 75%" 869 1099 TODO: check and proof-read all sub-sections. 870 1100 }}} 871 == Syntax variant for Handle system Persistent Identifier URIs ==1101 == Syntax variant for Handle system Persistent Identifier URIs 872 1102 Persistent Identifiers from the Handle system are defined in two syntax variants: a regular URI format for the Handle protocol, i.e. with a `hdl:` prefix, or ''actionable'' URIs with a `http://hdl.handle.net/` prefix. Generally, CLARIN software should support both syntax variants, therefore the CLARIN-FCS Interface Specification does not endorse a specific syntax variant. However, Endpoints are recommended to use the ''actionable'' syntax variant. 873 1103 874 == Referring to an Endpoint from a CMDI record == 875 Centers are encouraged to provide links to their CLARIN-FCS Endpoints in the metadata records for their resources. Other services, like the VLO, can use this information for automatically configuring an Aggregator for searching resources at the Endpoint. To refer to an Endpoint, a `<cmdi:ResourceProxy>` element with child-element `<cmdi:ResourceType>` set to the value `SearchService` and a `@mimetype` attribute with a value of `application/sru+xml` need to be added to the CMDI record. The content of the `<cmdi:ResourceRef>` element must contain a URI that points to the Endpoint web service. 876 877 Example: {{{#!xml <cmdi:CMD xmlns:cmdi="http://www.clarin.eu/cmd/" CMDVersion="1.1"> 878 1104 == Referring to an Endpoint from a CMDI record 1105 Centers are encouraged to provide links to their CLARIN-FCS Endpoints in the metadata records for their resources. Other services, like the VLO, can use this information for automatically configuring an Aggregator for searching resources at the Endpoint. 1106 1107 To refer to an Endpoint, a `<cmdi:ResourceProxy>` element with child-element `<cmdi:ResourceType>` set to the value `SearchService` and a `@mimetype` attribute with a value of `application/sru+xml` need to be added to the CMDI record. The content of the `<cmdi:ResourceRef>` element must contain a URI that points to the Endpoint web service. 1108 1109 Example: 1110 {{{#!xml 1111 <cmdi:CMD xmlns:cmdi="http://www.clarin.eu/cmd/" CMDVersion="1.1"> 879 1112 <cmdi:Header> 880 <!-- ... --> <cmdi:MdSelfLink> http://hdl.handle.net/4711/0815 </cmdi:MdSelfLink > <!-- ... --> 881 </cmdi:Header > <cmdi:Resources> 1113 <!-- ... --> 1114 <cmdi:MdSelfLink>http://hdl.handle.net/4711/0815</cmdi:MdSelfLink> 1115 <!-- ... --> 1116 </cmdi:Header> 1117 <cmdi:Resources> 882 1118 <cmdi:ResourceProxyList> 883 <!-- ... --> <cmdi:ResourceProxy id="r4711"> 884 <cmdi:ResourceType mimetype="application/sru+xml"> SearchService </cmdi:ResourceType > <cmdi:ResourceRef> http://repos.example.org/fcs-endpoint </cmdi:ResourceRef > 885 </cmdi:ResourceProxy > <!-- ... --> 886 </cmdi:ResourceProxyList > 887 </cmdi:Resources > <!-- ... --> 888 889 </cmdi:CMD> }}} 890 891 == Endpoint custom extensions == #extensionExample 1119 <!-- ... --> 1120 <cmdi:ResourceProxy id="r4711"> 1121 <cmdi:ResourceType mimetype="application/sru+xml">SearchService</cmdi:ResourceType> 1122 <cmdi:ResourceRef>http://repos.example.org/fcs-endpoint</cmdi:ResourceRef> 1123 </cmdi:ResourceProxy> 1124 <!-- ... --> 1125 </cmdi:ResourceProxyList> 1126 </cmdi:Resources> 1127 <!-- ... --> 1128 </cmdi:CMD> 1129 }}} 1130 1131 == Endpoint custom extensions #extensionExample 892 1132 The CLARIN-FCS protocol specification allows Endpoints to add custom data to their responses, e.g. to provide hints to an (XSLT/XQuery) application that works directly on CLARIN-FCS. It could use the custom data to generate back and forward links for a GUI to navigate in a result set. 893 1133 894 The following example illustrates how extensions can be embedded into the Result Format: {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 895 896 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 897 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 898 The quick brown <hits:Hit> fox</hits:Hit > jumps over the lazy <hits:Hit> dog</hits:Hit >. 899 </hits:Result > 900 </fcs:DataView > 901 902 <!-- 903 NOTE: this is purely fictional and only serves to demonstrate how 904 to add custom extensions to the result representation within CLARIN-FCS. 905 --> 906 907 <!-- 908 Example 1: a hypothetical Endpoint extension for navigation in a result set: it basically provides a set of hrefs, that a GUI can convert into navigation buttions. 909 --> <nav:navigation xmlns:nav="http://repos.example.org/navigation"> 910 <nav:curr href="http://repos.example.org/resultset/4711/4611" /> <nav:prev href="http://repos.example.org/resultset/4711/4610" /> <nav:next href="http://repos.example.org/resultset/4711/4612" /> 911 </nav:navigation > 912 913 <!-- 914 Example 2: a hypothetical Endpoint extension for directly referencing parent resources: it basically provides a link to the parent resource, that can be exploited by a GUI (e.g. build on XSLT/XQuery). 915 --> <parent:Parent xmlns:parent="http://repos.example.org/parent " 916 ref="http://repos.example.org/path/to/parent/1235.cmdi " /> 917 918 </fcs:Resource> }}} 919 920 == Endpoint highlight hints for repositories == 1134 The following example illustrates how extensions can be embedded into the Result Format: 1135 {{{#!xml 1136 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 1137 <fcs:DataView type="application/x-clarin-fcs-hits+xml"> 1138 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 1139 The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy <hits:Hit>dog</hits:Hit>. 1140 </hits:Result> 1141 </fcs:DataView> 1142 1143 <!-- 1144 NOTE: this is purely fictional and only serves to demonstrate how 1145 to add custom extensions to the result representation 1146 within CLARIN-FCS. 1147 --> 1148 1149 <!-- 1150 Example 1: a hypothetical Endpoint extension for navigation in a result 1151 set: it basically provides a set of hrefs, that a GUI can convert into 1152 navigation buttions. 1153 --> 1154 <nav:navigation xmlns:nav="http://repos.example.org/navigation"> 1155 <nav:curr href="http://repos.example.org/resultset/4711/4611" /> 1156 <nav:prev href="http://repos.example.org/resultset/4711/4610" /> 1157 <nav:next href="http://repos.example.org/resultset/4711/4612" /> 1158 </nav:navigation> 1159 1160 <!-- 1161 Example 2: a hypothetical Endpoint extension for directly referencing parent 1162 resources: it basically provides a link to the parent resource, that can be 1163 exploited by a GUI (e.g. build on XSLT/XQuery). 1164 --> 1165 <parent:Parent xmlns:parent="http://repos.example.org/parent" 1166 ref="http://repos.example.org/path/to/parent/1235.cmdi" /> 1167 </fcs:Resource> 1168 }}} 1169 1170 == Endpoint highlight hints for repositories 921 1171 An Aggregator can use the `@ref` attributes of the `<fcs:Resource>`, `<fcs:ResourceFragment>` or `<fcs:DataView>` elements to provide a link for the user to directly jump to the resource at a Repository. To support hit highlighting, an Endpoint can augment the URI in the `@ref` attribute with query parameters to implement hit highlighting in the Repository. 922 1172 923 In the following example, the URI `http://repos.example.org/resource.cgi/<pid>` is a CGI script that displays a given resource at the Repository in HTML format and uses the `highlight` query parameter to add highlights to the resource. Of course, it's up to the Endpoint and the Repository, if and how they implement such a feature. {{{#!xml <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 924 1173 In the following example, the URI `http://repos.example.org/resource.cgi/<pid>` is a CGI script that displays a given resource at the Repository in HTML format and uses the `highlight` query parameter to add highlights to the resource. Of course, it's up to the Endpoint and the Repository, if and how they implement such a feature. 1174 {{{#!xml 1175 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/resource" pid="http://hdl.handle.net/4711/0815"> 925 1176 <fcs:DataView type="application/x-clarin-fcs-hits+xml" ref="http://repos.example.org/resource.cgi/4711/0815?highlight=fox"> 926 1177 <hits:Result xmlns:hits="http://clarin.eu/fcs/dataview/hits"> 927 The quick brown <hits:Hit> fox</hits:Hit> jumps over the lazy dog.928 </hits:Result 929 </fcs:DataView 930 931 </fcs:Resource>}}}1178 The quick brown <hits:Hit>fox</hits:Hit> jumps over the lazy dog. 1179 </hits:Result> 1180 </fcs:DataView> 1181 </fcs:Resource> 1182 }}}