Changes between Version 80 and Version 81 of Taskforces/FCS/FCS-Specification-Draft
- Timestamp:
- 09/06/18 14:31:30 (6 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Taskforces/FCS/FCS-Specification-Draft
v80 v81 228 228 Endpoints need to provide information about their capabilities to support auto-configuration of Clients. The ''Endpoint Description'' mechanism provides the necessary facility to provide this information to the Clients. Endpoints `MUST` encode their capabilities using an XML format and embed this information into the SRU/CQL protocol as described in section [#Operationexplain Operation ''explain'']. The XML fragment generated by the Endpoint for the Endpoint Description `MUST` be valid according to the XML schema "[source:FederatedSearch/schema/Core_2/Endpoint-Description.xsd Endpoint-Description.xsd]" ([source:FederatedSearch/schema/Core_2/Endpoint-Description.xsd?format=txt download]). 229 229 230 The XML fragment for ''Endpoint Description'' is encoded as an `<ed:EndpointDescription>` element ,that contains the following attributes and children:230 The XML fragment for ''Endpoint Description'' is encoded as an `<ed:EndpointDescription>` element that contains the following attributes and children: 231 231 * one `@version` attribute (`REQUIRED`) on the `<ed:EndpointDescription>` element. The value of the `@version` attribute `MUST` be `2`. 232 232 * one `<ed:Capabilities>` element (`REQUIRED`) that contains one or more `<ed:Capability>` elements \\ 233 The content of the `<ed:Capability>` element is a Capability Identifier , that indicates the capabilities,that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values.233 The content of the `<ed:Capability>` element is a Capability Identifier that indicates the capabilities that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values. 234 234 * one `<ed:SupportedDataViews>` element (`REQUIRED`) \\ 235 235 A list of Data Views that are supported by this Endpoint. This list is composed of one or more `<ed:SupportedDataView>` elements. The content of a `<ed:SupportedDataView>` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. Each `<ed:SupportedDataView>` element `MUST` carry an `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate which Data View is supported by a resource (see below). Endpoints `SHOULD` use the recommended short identifier for the Data View. The `@delivery-policy` indicates, the Endpoint's delivery policy, for that Data View. Valid values are `send-by-default` for the ''send-by-default'' and `need-to-request` for the ''need-to-request'' delivery policy. \\ … … 245 245 A list of (top-level) resources that are available, i.e. searchable, at the Endpoint. The `<ed:Resources>` element contains one or more `<ed:Resource>` elements (see below). The Endpoint `MUST` declare at least one (top-level) resource. 246 246 247 The `<ed:Resource>` element contains a basic description of a resource that is available at the Endpoint. A resource is a searchable entity, e.g. a single corpus. The `<ed:Resource s>` has a mandatory `@pid` attribute that contains persistent identifier of the resource. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the resource. The `<ed:Resources>` element contains the following children:247 The `<ed:Resource>` element contains a basic description of a resource that is available at the Endpoint. A resource is a searchable entity, e.g. a single corpus. The `<ed:Resource>` has a mandatory `@pid` attribute that contains persistent identifier of the resource. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the resource. The `<ed:Resource>` element contains the following children: 248 248 * one or more `<ed:Title>` elements (`REQUIRED`) \\ 249 249 A human readable title for the resource. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language. … … 255 255 The (relevant) languages available within the resource. The `<ed:Languages>` element contains one or more `<ed:Language>` elements. The content of a `<ed:Language>` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the resource, however this list `MUST NOT` contain duplicate entries. 256 256 * one `<ed:AvailableDataViews>` element (`REQUIRED`) \\ 257 The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute , that contains a whitespace-separated list of id values,that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\257 The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute that contains a whitespace-separated list of id values that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\ 258 258 In case of sub-resources, each Resource `SHOULD` support all Data Views that are supported by the parent resource. However, every resource `MUST` declare all available Data Views independently, i.e. there is no implicit inheritance semantic. 259 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute , that contains a whitespace-separated list of id values,that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\259 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute that contains a whitespace-separated list of id values that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\ 260 260 In case of sub-resources, each Resource `SHOULD` support all Layers that are supported by the parent resource. However, every resource `MUST` declare all available Layers independently, i.e. there is no implicit inheritance semantic. 261 261 * zero or one `<ed:Resources>` element (`OPTIONAL`) \\ … … 276 276 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 277 277 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 278 <ed:Description xml:lang="de">D er GoetheKorpus des IDS Mannheim.</ed:Description>278 <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description> 279 279 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 280 280 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> … … 304 304 <ed:Title xml:lang="de">Goethe Korpus</ed:Title> 305 305 <ed:Title xml:lang="en">Goethe corpus</ed:Title> 306 <ed:Description xml:lang="de">D er GoetheKorpus des IDS Mannheim.</ed:Description>306 <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description> 307 307 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 308 308 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> … … 415 415 416 416 ### Advanced Search ### 417 The ''Advanced Search'' capability allows searching in annotated data ,that is represented in annotation layers. An annotation ''layer'' contains annotations of a specific type, e.g. lemma or part-of-speech layer. Queries can be performed across annotation layer.417 The ''Advanced Search'' capability allows searching in annotated data that is represented in annotation layers. An annotation ''layer'' contains annotations of a specific type, e.g. lemma or part-of-speech layer. Queries can be performed across annotation layer. 418 418 419 419 CLARIN-FCS defines a set of searchable annotation layers with certain semantics and syntax. Endpoints `SHOULD` support as many different, of course depending on the resource type, annotation layers as possible. … … 434 434 435 435 #### FCS-QL #### 436 Queries in ''Advanced Search'' `MUST` be performed using ''FCS-QL'' ([#FCS-QLEBNF FCS-QL]). The Endpoint `MUST` support parsing all of FCS-QL. If an Endpoint does not support a query, i.e. the used operators or layers are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). Though if the parameter `x-fcs-rewrites-allowed` is set to `true`the Endpoint `MAY` rewrite the query with changed recall as a result.436 Queries in ''Advanced Search'' `MUST` be performed using ''FCS-QL'' ([#FCS-QLEBNF FCS-QL]). The Endpoint `MUST` support parsing all of FCS-QL. If an Endpoint does not support a query, i.e. the used operators or layers are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). However, if the parameter `x-fcs-rewrites-allowed` is set to `true`, the Endpoint `MAY` rewrite the query with changed recall as a result. 437 437 438 438 The Endpoint `MUST` perform the query on the annotation layers that makes the most sense for the user, e.g. if no specific PartofSpeech layer is given with several layers available from the Discovery phase it should use the most generic one. Endpoints `SHOULD` perform the query with case sensitivity as specified in the query which by default is case sensitive. … … 461 461 462 462 ### Result Format ### 463 The Search Engine will produce a result set containing several hits as the outcome of processing a query. The Endpoint `MUST` serialize these hits in the CLARIN-FCS result format. Endpoints are `REQUIRED` to adhere to the principle ,that ''one'' hit `MUST` be serialized as ''one'' CLARIN-FCS result record and `MUST NOT` combine several hits in one CLARIN-FCS result record. E.g., if a query matches five different sentences within one text (= the resource), the Endpoint must serialize them as five SRU records each with one Hit each referencing the same containing Resource (see section [#searchRetrieve Operation ''searchRetrieve'']).463 The Search Engine will produce a result set containing several hits as the outcome of processing a query. The Endpoint `MUST` serialize these hits in the CLARIN-FCS result format. Endpoints are `REQUIRED` to adhere to the principle that ''one'' hit `MUST` be serialized as ''one'' CLARIN-FCS result record and `MUST NOT` combine several hits in one CLARIN-FCS result record. E.g., if a query matches five different sentences within one text (= the resource), the Endpoint must serialize them as five SRU records each with one Hit each referencing the same containing Resource (see section [#searchRetrieve Operation ''searchRetrieve'']). 464 464 465 465 CLARIN-FCS uses a customized format for returning results. ''Resource'' and ''Resource Fragments'' serve as containers for hit results, which are presented in one or more ''Data View''. The following section describes the Resource format and Data View format and section [#searchRetrieve Operation ''searchRetrieve''] will describe how hits are embedded within SRU responses. … … 474 474 A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit (for example if the hit is a sentence within a large text). A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If the Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View within the Resource Fragment. 475 475 476 Endpoints `SHOULD` always provide a link to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing a URI ,that is unique for the Endpoint. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.476 Endpoints `SHOULD` always provide a link to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing a URI that is unique for the Endpoint. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI. 477 477 478 478 If the Endpoint can provide both, a persistent identifier as well as a URI, for either Resource or Resource Fragment, then they `SHOULD` provide both. When working with results, Clients `SHOULD` prefer persistent identifiers over regular URIs. … … 535 535 Data Views are classified into a ''send-by-default'' and a ''need-to-request'' delivery policy. In case of the ''send-by-default'' delivery policy, the Endpoint `MUST` send the Data View automatically, i.e. Endpoints `MUST` unconditionally include the Data View when they serialize a response to a search request. In the case of ''need-to-request'', the Client must explicitly request the Endpoint to include this Data View in the response. This enables the Endpoint to not generate and serialize Data Views that are "expensive" in terms of computational power or bandwidth for every response. To request such a Data View, a Client `MUST` submit a comma separated list of Data View identifiers (see section [#endpointDescription Endpoint Description]) in the `x-fcs-dataviews` extra request parameter with the ''searchRetrieve'' request. If a Client requests a Data View that is not valid for the search context, the Endpoint `MUST` generate a non-fatal diagnostic `http://clarin.eu/fcs/diagnostic/4` ("Requested Data View not valid for this resource"). The details field of the diagnostic `MUST` contain the MIME type of the Data View that was not valid. If more than one requested Data View is invalid, the Endpoint `MUST` generate a ''separate'' non-fatal diagnostic `http://clarin.eu/fcs/diagnostic/4` for each of the requested Data Views. 536 536 537 The description of every Data View contains a recommendation as to how the Endpoint should handle the payload delivery, i.e. if a Data View is by default considered ''send-by-default'' or ''need-to-request''. Endpoint `MAY` choose to implement different policy. The relevant information which policy is implemented by an Endpoint for a specific Data View is part of the ''Endpoint Description'' (see section [#endpointDescription Endpoint Description]). For each Data View, a ''Recommended Short Identifier'' is defined ,that Endpoint `SHOULD` use for an identifier of the Data View in the list of supported Data Views in the ''Endpoint Description''537 The description of every Data View contains a recommendation as to how the Endpoint should handle the payload delivery, i.e. if a Data View is by default considered ''send-by-default'' or ''need-to-request''. Endpoint `MAY` choose to implement different policy. The relevant information which policy is implemented by an Endpoint for a specific Data View is part of the ''Endpoint Description'' (see section [#endpointDescription Endpoint Description]). For each Data View, a ''Recommended Short Identifier'' is defined that Endpoint `SHOULD` use for an identifier of the Data View in the list of supported Data Views in the ''Endpoint Description'' 538 538 539 539 The ''Generic Hits'' Data View is mandatory, thus all Endpoints `MUST` implement it and provide search results represented in the ''Generic Hits'' Data View. Endpoints `MUST` implement the ''Generic Hits'' Data View with the ''send-by-default'' delivery policy. … … 576 576 ||=XML Schema =|| [source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd DataView-Advanced.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd?format=txt download]) || 577 577 578 The ''Advanced (ADV)'' Data View serves as the natu al serialization of search results for ''Advanced Search'' queries. The ADV Data View supports structured information in one or more annotation layers. The annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets. The list of `Segment` elements building a stream can be of type `item` for character-based streams or `timestamp` for audio streams (granularity up to 0.001s). The Endpoint is responsible for choosing the proper offsets for the segments. The segments `MUST` be possible to align over all annotation layers. For character streams the recommendation is Unicode Normalization Form ''KC''. Segments `MAY` also have an endpoint specific reference indicated by an URI that could be shown in the Aggregator, e.g. to open an audio player or other viewer with contents from the Search Engine. The list of `Layer` elements contains `Span` elements making references to the segments. A `Span` inherits the start and end offsets from its segments and contains the actual annotation as its content. It `MAY` also carry information about the original annotation value in an `@alt-value` attribute. The document order of the `Layer` elements define the view order in the Aggregator. Each Layer has a ''Layer type identifier'' and a ''Layer identifier''. The Endpoint `SHOULD` at least return all layers that were referenced in the Advanced Search query. It `MAY` return more layers. The attribute `@highlight` is used to mark Spans as hits. Multiple hit markers are supported and the Aggregator `MAY` display them visually distinct. It is up to the Endpoint to decide what should be marked as a hit, but the recommendation is to mark everything referenced in the Advanced Search query.578 The ''Advanced (ADV)'' Data View serves as the natural serialization of search results for ''Advanced Search'' queries. The ADV Data View supports structured information in one or more annotation layers. The annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets. The list of `Segment` elements building a stream can be of type `item` for character-based streams or `timestamp` for audio streams (granularity up to 0.001s). The Endpoint is responsible for choosing the proper offsets for the segments. The segments `MUST` be possible to align over all annotation layers. For character streams the recommendation is Unicode Normalization Form ''KC''. Segments `MAY` also have an endpoint specific reference indicated by an URI that could be shown in the Aggregator, e.g. to open an audio player or other viewer with contents from the Search Engine. The list of `Layer` elements contains `Span` elements making references to the segments. A `Span` inherits the start and end offsets from its segments and contains the actual annotation as its content. It `MAY` also carry information about the original annotation value in an `@alt-value` attribute. The document order of the `Layer` elements define the view order in the Aggregator. Each Layer has a ''Layer type identifier'' and a ''Layer identifier''. The Endpoint `SHOULD` at least return all layers that were referenced in the Advanced Search query. It `MAY` return more layers. The attribute `@highlight` is used to mark Spans as hits. Multiple hit markers are supported and the Aggregator `MAY` display them visually distinct. It is up to the Endpoint to decide what should be marked as a hit, but the recommendation is to mark everything referenced in the Advanced Search query. 579 579 580 580 {{{#!comment … … 731 731 Endpoints or Clients `MUST` support CQL conformance ''Level 2'' (as defined in [#REF_OASIS_CQL OASIS-CQL, section 6]), i.e. be able to ''parse'' (Endpoints) or ''serialize'' (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface. 732 732 733 '''NOTE''': this does ''not imply'' ,that Endpoints are ''required'' to support all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support.733 '''NOTE''': this does ''not imply'' that Endpoints are ''required'' to support all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support. 734 734 735 735 Endpoints `MUST` generate diagnostics according to [#REF_SRU_20 OASIS-SRU-20, Appendix D] for error conditions or to indicate unsupported features. Unfortunately, the OASIS specification does not provides a comprehensive list of diagnostics for CQL-related errors. Therefore, Endpoints `MUST` use diagnostics from [#REF_LOC_DIAG LOC-DIAG, section "Diagnostics Relating to CQL"] for CQL related errors. … … 778 778 <zr:title lang="de">Goethe Corpus</zr:title> 779 779 <zr:title lang="en" primary="true">Goethe Korpus</zr:title> 780 <zr:description lang="de">D er GoetheKorpus des IDS Mannheim.</zr:description>780 <zr:description lang="de">Das Goethe-Korpus des IDS Mannheim.</zr:description> 781 781 <zr:description lang="en" primary="true">The Goethe corpus of IDS Mannheim.</zr:description> 782 782 </zr:databaseInfo> … … 813 813 <ed:Title xml:lang="de">Goethe Corpus</ed:Title> 814 814 <ed:Title xml:lang="en">Goethe Korpus</ed:Title> 815 <ed:Description xml:lang="de">D er GoetheKorpus des IDS Mannheim.</ed:Description>815 <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description> 816 816 <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description> 817 817 <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI> … … 1125 1125 If an invalid persistent identifier is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/1` diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search, or it `MAY` treat it as non-fatal and perform the search. 1126 1126 1127 If a Client wants to request one or more Data Views ,that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#Operationexplain ''explain''] and section [#EndpointDescription Endpoint Description]).1127 If a Client wants to request one or more Data Views that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#Operationexplain ''explain''] and section [#EndpointDescription Endpoint Description]). 1128 1128 1129 1129 For example, to request the CMDI Data View from an Endpoint that has an Endpoint Description, as described in [#REF_Example_5 Example 5], a Client would need to use the ''Data View identifier'' `cmdi` and submit the following request: … … 1343 1343 <!-- 1344 1344 Example 1: a hypothetical Endpoint extension for navigation in a result 1345 set: it basically provides a set of hrefs ,that a GUI can convert into1345 set: it basically provides a set of hrefs that a GUI can convert into 1346 1346 navigation buttions. 1347 1347 --> … … 1354 1354 <!-- 1355 1355 Example 2: a hypothetical Endpoint extension for directly referencing parent 1356 resources: it basically provides a link to the parent resource ,that can be1356 resources: it basically provides a link to the parent resource that can be 1357 1357 exploited by a GUI (e.g. build on XSLT/XQuery). 1358 1358 -->