Changes between Version 80 and Version 81 of Taskforces/FCS/FCS-Specification-Draft


Ignore:
Timestamp:
09/06/18 14:31:30 (6 years ago)
Author:
fisseni@ids-mannheim.de
Comment:

some Typos

Legend:

Unmodified
Added
Removed
Modified
  • Taskforces/FCS/FCS-Specification-Draft

    v80 v81  
    228228Endpoints need to provide information about their capabilities to support auto-configuration of Clients. The ''Endpoint Description'' mechanism provides the necessary facility to provide this information to the Clients. Endpoints `MUST` encode their capabilities using an XML format and embed this information into the SRU/CQL protocol as described in section [#Operationexplain Operation ''explain'']. The XML fragment generated by the Endpoint for the Endpoint Description `MUST` be valid according to the XML schema "[source:FederatedSearch/schema/Core_2/Endpoint-Description.xsd Endpoint-Description.xsd]" ([source:FederatedSearch/schema/Core_2/Endpoint-Description.xsd?format=txt download]).
    229229
    230 The XML fragment for ''Endpoint Description'' is encoded as an `<ed:EndpointDescription>` element, that contains the following attributes and children:
     230The XML fragment for ''Endpoint Description'' is encoded as an `<ed:EndpointDescription>` element that contains the following attributes and children:
    231231 * one `@version` attribute (`REQUIRED`) on the `<ed:EndpointDescription>` element. The value of the `@version` attribute `MUST` be `2`.
    232232 * one `<ed:Capabilities>` element (`REQUIRED`) that contains one or more `<ed:Capability>` elements \\
    233    The content of the `<ed:Capability>` element is a Capability Identifier, that indicates the capabilities, that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values.
     233   The content of the `<ed:Capability>` element is a Capability Identifier that indicates the capabilities that are supported by the Endpoint. For valid values for the Capability Identifier, see section [#capabilities Capabilities]. This list `MUST NOT` include duplicate values.
    234234 * one `<ed:SupportedDataViews>` element (`REQUIRED`) \\
    235235   A list of Data Views that are supported by this Endpoint. This list is composed of one or more `<ed:SupportedDataView>` elements. The content of a `<ed:SupportedDataView>` `MUST` be the MIME type of a supported Data View, e.g. `application/x-clarin-fcs-hits+xml`. Each `<ed:SupportedDataView>` element `MUST` carry an `@id` and a `@delivery-policy` attribute. The value of the `@id` attribute is later used in the `<ed:Resource>` element to indicate which Data View is supported by a resource (see below). Endpoints `SHOULD` use the recommended short identifier for the Data View. The `@delivery-policy` indicates, the Endpoint's delivery policy, for that Data View. Valid values are `send-by-default` for the ''send-by-default'' and `need-to-request` for the ''need-to-request'' delivery policy. \\
     
    245245   A list of (top-level) resources that are available, i.e. searchable, at the Endpoint. The `<ed:Resources>` element contains one or more `<ed:Resource>` elements (see below). The Endpoint `MUST` declare at least one (top-level) resource.
    246246
    247 The `<ed:Resource>` element contains a basic description of a resource that is available at the Endpoint. A resource is a searchable entity, e.g. a single corpus. The `<ed:Resources>` has a mandatory `@pid` attribute that contains persistent identifier of the resource. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the resource. The `<ed:Resources>` element contains the following children:
     247The `<ed:Resource>` element contains a basic description of a resource that is available at the Endpoint. A resource is a searchable entity, e.g. a single corpus. The `<ed:Resource>` has a mandatory `@pid` attribute that contains persistent identifier of the resource. This value `MUST` be the same as the ''!MdSelfLink'' of the CMDI record describing the resource. The `<ed:Resource>` element contains the following children:
    248248 * one or more `<ed:Title>` elements (`REQUIRED`) \\
    249249   A human readable title for the resource. A `REQUIRED` `@xml:lang` attribute indicates the language of the title. An English version of  the title is `REQUIRED`. The list of titles `MUST NOT` contain duplicate entries for the same language.
     
    255255   The (relevant) languages available within the resource. The `<ed:Languages>` element contains one or more `<ed:Language>` elements. The content of a `<ed:Language>` element `MUST` be a ISO 639-3 three letter language code. This element should be repeated for all languages (relevant) available ''within'' the resource, however this list `MUST NOT` contain duplicate entries.
    256256 * one `<ed:AvailableDataViews>` element (`REQUIRED`) \\
    257    The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute, that contains a whitespace-separated list of id values, that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\
     257   The Data Views that are available for the resource. The `<ed:AvailableDataViews>` element `MUST` carry a `@ref` attribute that contains a whitespace-separated list of id values that correspond to value of the appropriate `@id` attribute for the `<ed:SupportedDataView>` elements that are referenced. \\
    258258   In case of sub-resources, each Resource `SHOULD` support all Data Views that are supported by the parent resource. However, every resource `MUST` declare all available Data Views independently, i.e. there is no implicit inheritance semantic.
    259  * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute, that contains a whitespace-separated list of id values, that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\
     259 * one `<ed:AvailableLayers>` element (`REQUIRED` if Endpoint supports ''Advanced Search'' capability). The `<ed:AvailableLayers>` element `MUST` carry a `@ref` attribute that contains a whitespace-separated list of id values that correspond to the value of the appropriate `@id` attribute for the `<ed:SupportedLayer>` elements that are referenced. \\
    260260   In case of sub-resources, each Resource `SHOULD` support all Layers that are supported by the parent resource. However, every resource `MUST` declare all available Layers independently, i.e. there is no implicit inheritance semantic.
    261261* zero or one `<ed:Resources>` element (`OPTIONAL`) \\
     
    276276            <ed:Title xml:lang="de">Goethe Korpus</ed:Title>
    277277            <ed:Title xml:lang="en">Goethe corpus</ed:Title>
    278             <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description>
     278            <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
    279279            <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
    280280            <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
     
    304304            <ed:Title xml:lang="de">Goethe Korpus</ed:Title>
    305305            <ed:Title xml:lang="en">Goethe corpus</ed:Title>
    306             <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description>
     306            <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
    307307            <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
    308308            <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
     
    415415
    416416### Advanced Search ###
    417 The ''Advanced Search'' capability allows searching in annotated data, that is represented in annotation layers. An annotation ''layer'' contains annotations of a specific type, e.g. lemma or part-of-speech layer. Queries can be performed across annotation layer.
     417The ''Advanced Search'' capability allows searching in annotated data that is represented in annotation layers. An annotation ''layer'' contains annotations of a specific type, e.g. lemma or part-of-speech layer. Queries can be performed across annotation layer.
    418418
    419419CLARIN-FCS defines a set of searchable annotation layers with certain semantics and syntax. Endpoints `SHOULD` support as many different, of course depending on the resource type, annotation layers as possible.
     
    434434
    435435#### FCS-QL ####
    436 Queries in ''Advanced Search'' `MUST` be performed using ''FCS-QL'' ([#FCS-QLEBNF FCS-QL]). The Endpoint `MUST` support parsing all of FCS-QL. If an Endpoint does not support a query, i.e. the used operators or layers are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). Though if the parameter `x-fcs-rewrites-allowed` is set to `true` the Endpoint `MAY` rewrite the query with changed recall as a result.
     436Queries in ''Advanced Search'' `MUST` be performed using ''FCS-QL'' ([#FCS-QLEBNF FCS-QL]). The Endpoint `MUST` support parsing all of FCS-QL. If an Endpoint does not support a query, i.e. the used operators or layers are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic ([#REF_LOC_DIAG LOC-DIAG]). However, if the parameter `x-fcs-rewrites-allowed` is set to `true`, the Endpoint `MAY` rewrite the query with changed recall as a result.
    437437
    438438The Endpoint `MUST` perform the query on the annotation layers that makes the most sense for the user, e.g. if no specific PartofSpeech layer is given with several layers available from the Discovery phase it should use the most generic one. Endpoints `SHOULD` perform the query with case sensitivity as specified in the query which by default is case sensitive.
     
    461461
    462462### Result Format ###
    463 The Search Engine will produce a result set containing several hits as the outcome of processing a query. The Endpoint `MUST` serialize these hits in the CLARIN-FCS result format. Endpoints are `REQUIRED` to adhere to the principle, that ''one'' hit `MUST` be serialized as ''one'' CLARIN-FCS result record and `MUST NOT` combine several hits in one CLARIN-FCS result record. E.g., if a query matches five different sentences within one text (= the resource), the Endpoint must serialize them as five SRU records each with one Hit each referencing the same containing Resource (see section [#searchRetrieve Operation ''searchRetrieve'']).
     463The Search Engine will produce a result set containing several hits as the outcome of processing a query. The Endpoint `MUST` serialize these hits in the CLARIN-FCS result format. Endpoints are `REQUIRED` to adhere to the principle that ''one'' hit `MUST` be serialized as ''one'' CLARIN-FCS result record and `MUST NOT` combine several hits in one CLARIN-FCS result record. E.g., if a query matches five different sentences within one text (= the resource), the Endpoint must serialize them as five SRU records each with one Hit each referencing the same containing Resource (see section [#searchRetrieve Operation ''searchRetrieve'']).
    464464
    465465CLARIN-FCS uses a customized format for returning results. ''Resource'' and ''Resource Fragments'' serve as containers for hit results, which are presented in one or more ''Data View''. The following section describes the Resource format and Data View format and section [#searchRetrieve Operation ''searchRetrieve''] will describe how hits are embedded within SRU responses.
     
    474474A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit (for example if the hit is a sentence within a large text). A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If the Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View within the Resource Fragment.
    475475
    476 Endpoints `SHOULD` always provide a link to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing a URI, that is unique for the Endpoint. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.
     476Endpoints `SHOULD` always provide a link to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing a URI that is unique for the Endpoint. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.
    477477
    478478If the Endpoint can provide both, a persistent identifier as well as a URI, for either Resource or Resource Fragment, then they `SHOULD` provide both. When working with results, Clients `SHOULD` prefer persistent identifiers over regular URIs.
     
    535535Data Views are classified into a ''send-by-default'' and a ''need-to-request'' delivery policy. In case of the ''send-by-default'' delivery policy, the Endpoint `MUST` send the Data View automatically, i.e. Endpoints `MUST` unconditionally include the Data View when they serialize a response to a search request. In the case of ''need-to-request'', the Client must explicitly request the Endpoint to include this Data View in the response. This enables the Endpoint to not generate and serialize Data Views that are "expensive" in terms of computational power or bandwidth for every response. To request such a Data View, a Client `MUST` submit a comma separated list of Data View identifiers (see section [#endpointDescription Endpoint Description]) in the `x-fcs-dataviews` extra request parameter with the ''searchRetrieve'' request. If a Client requests a Data View that is not valid for the search context, the Endpoint `MUST` generate a non-fatal diagnostic `http://clarin.eu/fcs/diagnostic/4` ("Requested Data View not valid for this resource"). The details field of the diagnostic `MUST` contain the MIME type of the Data View that was not valid. If more than one requested Data View is invalid, the Endpoint `MUST` generate a ''separate'' non-fatal diagnostic `http://clarin.eu/fcs/diagnostic/4` for each of the requested Data Views.
    536536
    537 The description of every Data View contains a recommendation as to how the Endpoint should handle the payload delivery, i.e. if a Data View is by default considered ''send-by-default'' or ''need-to-request''. Endpoint `MAY` choose to implement different policy. The relevant information which policy is implemented by an Endpoint for a specific Data View is part of the ''Endpoint Description'' (see section [#endpointDescription Endpoint Description]). For each Data View, a ''Recommended Short Identifier'' is defined, that Endpoint `SHOULD` use for an identifier of the Data View in the list of supported Data Views in the ''Endpoint Description''
     537The description of every Data View contains a recommendation as to how the Endpoint should handle the payload delivery, i.e. if a Data View is by default considered ''send-by-default'' or ''need-to-request''. Endpoint `MAY` choose to implement different policy. The relevant information which policy is implemented by an Endpoint for a specific Data View is part of the ''Endpoint Description'' (see section [#endpointDescription Endpoint Description]). For each Data View, a ''Recommended Short Identifier'' is defined that Endpoint `SHOULD` use for an identifier of the Data View in the list of supported Data Views in the ''Endpoint Description''
    538538
    539539The ''Generic Hits'' Data View is mandatory, thus all Endpoints `MUST` implement it and provide search results represented in the ''Generic Hits'' Data View. Endpoints `MUST` implement the ''Generic Hits'' Data View with the ''send-by-default'' delivery policy.
     
    576576||=XML Schema                   =|| [source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd DataView-Advanced.xsd] ([source:FederatedSearch/schema/Core_2/DataView-Advanced.xsd?format=txt download]) ||
    577577
    578 The ''Advanced (ADV)'' Data View serves as the natual serialization of search results for ''Advanced Search'' queries. The ADV Data View supports structured information in one or more annotation layers. The annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets. The list of `Segment` elements building a stream can be of type `item` for character-based streams or `timestamp` for audio streams (granularity up to 0.001s). The Endpoint is responsible for choosing the proper offsets for the segments. The segments `MUST` be possible to align over all annotation layers. For character streams the recommendation is Unicode Normalization Form ''KC''. Segments `MAY` also have an endpoint specific reference indicated by an URI that could be shown in the Aggregator, e.g. to open an audio player or other viewer with contents from the Search Engine. The list of `Layer` elements contains `Span` elements making references to the segments. A `Span` inherits the start and end offsets from its segments and contains the actual annotation as its content. It `MAY` also carry information about the original annotation value in an `@alt-value` attribute. The document order of the `Layer` elements define the view order in the Aggregator. Each Layer has a ''Layer type identifier'' and a ''Layer identifier''.  The Endpoint `SHOULD` at least return all layers that were referenced in the  Advanced Search query. It `MAY` return more layers. The attribute `@highlight` is used to mark Spans as hits. Multiple hit markers are supported and the Aggregator `MAY` display them visually distinct. It is up to the Endpoint to decide what should be marked as a hit, but the recommendation is to mark everything referenced in the Advanced Search query.
     578The ''Advanced (ADV)'' Data View serves as the natural serialization of search results for ''Advanced Search'' queries. The ADV Data View supports structured information in one or more annotation layers. The annotations are streams (ranges) over the signal in a stand-off like format with start and end offsets. The list of `Segment` elements building a stream can be of type `item` for character-based streams or `timestamp` for audio streams (granularity up to 0.001s). The Endpoint is responsible for choosing the proper offsets for the segments. The segments `MUST` be possible to align over all annotation layers. For character streams the recommendation is Unicode Normalization Form ''KC''. Segments `MAY` also have an endpoint specific reference indicated by an URI that could be shown in the Aggregator, e.g. to open an audio player or other viewer with contents from the Search Engine. The list of `Layer` elements contains `Span` elements making references to the segments. A `Span` inherits the start and end offsets from its segments and contains the actual annotation as its content. It `MAY` also carry information about the original annotation value in an `@alt-value` attribute. The document order of the `Layer` elements define the view order in the Aggregator. Each Layer has a ''Layer type identifier'' and a ''Layer identifier''.  The Endpoint `SHOULD` at least return all layers that were referenced in the  Advanced Search query. It `MAY` return more layers. The attribute `@highlight` is used to mark Spans as hits. Multiple hit markers are supported and the Aggregator `MAY` display them visually distinct. It is up to the Endpoint to decide what should be marked as a hit, but the recommendation is to mark everything referenced in the Advanced Search query.
    579579
    580580{{{#!comment
     
    731731Endpoints or Clients `MUST` support CQL conformance ''Level 2'' (as defined in [#REF_OASIS_CQL OASIS-CQL, section 6]), i.e. be able to ''parse'' (Endpoints) or ''serialize'' (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface.
    732732
    733 '''NOTE''': this does ''not imply'', that Endpoints are ''required'' to support all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support.
     733'''NOTE''': this does ''not imply'' that Endpoints are ''required'' to support all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support.
    734734
    735735Endpoints `MUST` generate diagnostics according to [#REF_SRU_20 OASIS-SRU-20, Appendix D] for error conditions or to indicate unsupported features. Unfortunately, the OASIS specification does not provides a comprehensive list of diagnostics for CQL-related errors. Therefore, Endpoints `MUST` use diagnostics from [#REF_LOC_DIAG LOC-DIAG, section "Diagnostics Relating to CQL"] for CQL related errors.
     
    778778          <zr:title lang="de">Goethe Corpus</zr:title>
    779779          <zr:title lang="en" primary="true">Goethe Korpus</zr:title>
    780           <zr:description lang="de">Der Goethe Korpus des IDS Mannheim.</zr:description>
     780          <zr:description lang="de">Das Goethe-Korpus des IDS Mannheim.</zr:description>
    781781          <zr:description lang="en" primary="true">The Goethe corpus of IDS Mannheim.</zr:description>
    782782        </zr:databaseInfo>
     
    813813              <ed:Title xml:lang="de">Goethe Corpus</ed:Title>
    814814              <ed:Title xml:lang="en">Goethe Korpus</ed:Title>
    815               <ed:Description xml:lang="de">Der Goethe Korpus des IDS Mannheim.</ed:Description>
     815              <ed:Description xml:lang="de">Das Goethe-Korpus des IDS Mannheim.</ed:Description>
    816816              <ed:Description xml:lang="en">The Goethe corpus of IDS Mannheim.</ed:Description>
    817817              <ed:LandingPageURI>http://repos.example.org/corpus1.html</ed:LandingPageURI>
     
    11251125If an invalid persistent identifier is passed by the Client, the Endpoint `MUST` issue a `http://clarin.eu/fcs/diagnostic/1` diagnostic, i.e. add the appropriate XML fragment to the `<sru:diagnostics>` element of the response. The Endpoint `MAY` treat this condition as fatal, i.e. just issue the diagnostic and perform no search, or it `MAY` treat it as non-fatal and perform the search.
    11261126
    1127 If a Client wants to request one or more Data Views, that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#Operationexplain ''explain''] and section [#EndpointDescription Endpoint Description]).
     1127If a Client wants to request one or more Data Views that are handled by Endpoint with the ''need-to-request'' delivery policy, it `MUST` pass a comma-separated list of ''Data View identifier'' in the `x-fcs-dataviews` extra request parameter of the 'searchRetrieve' request. A Client can extract valid values for the ''Data View identifiers'' from the `@id` attribute of the `<ed:SupportedDataView>` elements in the Endpoint Description of the Endpoint (see section [#Operationexplain ''explain''] and section [#EndpointDescription Endpoint Description]).
    11281128
    11291129For example, to request the CMDI Data View from an Endpoint that has an Endpoint Description, as described in [#REF_Example_5 Example 5], a Client would need to use the ''Data View identifier'' `cmdi` and submit the following request:
     
    13431343    <!--
    13441344        Example 1: a hypothetical Endpoint extension for navigation in a result
    1345         set: it basically provides a set of hrefs, that a GUI can convert into
     1345        set: it basically provides a set of hrefs that a GUI can convert into
    13461346        navigation buttions.   
    13471347    -->
     
    13541354    <!--
    13551355       Example 2: a hypothetical Endpoint extension for directly referencing parent
    1356        resources: it basically provides a link to the parent resource, that can be
     1356       resources: it basically provides a link to the parent resource that can be
    13571357       exploited by a GUI (e.g. build on XSLT/XQuery).
    13581358    -->