Changes between Version 21 and Version 22 of FCS-Specification-ScrapBook


Ignore:
Timestamp:
02/06/14 11:28:07 (10 years ago)
Author:
oschonef
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FCS-Specification-ScrapBook

    v21 v22  
    1717 4. Drop the restivenesses of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)`
    1818 5. Honor and use extension hooks provided by SRU/CQL
     19 6. Endpoint specific extension hooks, e.g. to avild tag abuse of DataView. Resource.xsd could provide an extension hook, so arbitary XML could also be embedded.
     20
    1921
    2022== Proposal for new specification ==
     
    124126The CLARIN-FCS interface specification defined two profiles, an extensible result format and a set of required operations. CLARIN-FCS is built on the SRU/CQL standard and additional functionality required for CLARIN-FCS is added through SRU/CQL's extension mechanisms.
    125127
    126 Generally, CLARIN-FCS Interface Specification consists of two components, a set of ''formats'' and a ''transport protocol''. The ''Endpoint'' component is a software component that acts as a bridge between the Formats, that are send by a ''Client'' using the ''Transport Protocol'', and a ''Search Engine''. The ''Search Engine'' is a custom software component, that allows searching in the language resources of a CLARIN center. The ''Endpoint'' basically implements the ''transport protocol''  and acts as an mediator between the CLRAIN-FCS speceific formats and the idiosyncrasies of ''Search Engines''. The following figure illustrates the overall architecture.
     128Generally, CLARIN-FCS Interface Specification consists of two components, a set of ''formats'' and a ''transport protocol''. The ''Endpoint'' component is a software component that acts as a bridge between the Formats, that are send by a ''Client'' using the ''Transport Protocol'', and a ''Search Engine''. The ''Search Engine'' is a custom software component, that allows searching in the language resources of a CLARIN center. The ''Endpoint'' basically implements the ''transport protocol''  and acts as an mediator between the CLRAIN-FCS specific formats and the idiosyncrasies of ''Search Engines''. The following figure illustrates the overall architecture.
    127129{{{
    128130                 +---------+
     
    192194   A ''Resource Fragment'' is smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription.
    193195
    194 Each ''Resource'' `SHOULD` be identified by a persistent identifier. A ''Resource'' `MAY` be identified by an endpoint unique URI. A ''Resource'' `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A ''Resource'' `SHOULD` contain a ''Resource Fragment'', if the hit consists of just a part of the ''Resource'' unit, if the hit is a sentence within a large text. A ''Resource Fragment'' `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using ''Resource Fragments'' is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a ''Resource Fragment'', the actual hit `SHOULD` be encoded as a ''Data View'' that is encoded in a ''Resource Fragment''.
    195 
    196 Endpoints `SHOULD` always provide a link to the resource itself, i.e. by supplying the persistent identifier o the ''Resource'' or providing a URI to reference the ''Resource''. If direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` use a URI to link to a web-page describing a corpus or collection (including instruction on how to obtain it). Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed, the ''Resource Fragment'' `SHOULD NOT` contain a persistent identifier or an URI.
    197 
    198 ''Resource'' and ''Resource Fragment'' are serialized in XML and Endpoints `MUST` generate responses, that are valid according to the XML schema "[source:FederatedSearch/Resource.xsd Resource.xsd]" ([source:FederatedSearch/Resource.xsd?format=txt download]). A ''Resource'' is encoded in the form of a `<fcs:Resource>` element, a ''Resource Fragment'' in the form of a `<fcs:ResourceFragment>` element. The content of a ''Data View'' is wrapped in a `<fcs:DataView>` element. `<fcs:Resource>` is the top-level element and `MAY` contain zero or more `<fcs:DataView>` elements and `MAY` contain zero or more `<fcs:ResourceFragment>` elements. A `<fcs:ResourceFragment>` element `MUST` contain one or more `<fcs:DataView>` elements. The elements `<fcs:Resource>`, `<fcs:ResourceFragment>` and `<fcs:DataView>` `MAY` carry a `@pid` and/or a `@ref` attribute, which allows linking to the original data represented by the resource, resource fragment, or data view. A `@pid` attribute `MUST` contain a valid persistent identifier, a `@ref` `MUST` contain valid URI (without the additional semantics of being persistent reference).
     196Each Resource `SHOULD` be identified by a persistent identifier. A Resource `MAY` be identified by an endpoint unique URI. A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that is encoded in a Resource Fragment.
     197
     198Endpoints `SHOULD` always provide a link to the resource itself, i.e. by supplying the persistent identifier of the Resource or providing an URI. If direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` use a URI to link to a web-page describing the corpus or collection,including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.
     199
     200If an Endpoint can provide both, a persistent identifier as well as an URI, for either Resource or Resource Fragment, they `SHOULD` provide both. When working with results, Clients `SHOULD` prefer persistent identifiers over regular URIs.
     201
     202Resource and Resource Fragment are serialized in XML and Endpoints `MUST` generate responses, that are valid according to the XML schema "[source:FederatedSearch/Resource.xsd Resource.xsd]" ([source:FederatedSearch/Resource.xsd?format=txt download]). A Resource is encoded in the form of a `<fcs:Resource>` element, a ''Resource Fragment'' in the form of a `<fcs:ResourceFragment>` element. The content of a Data View is wrapped in a `<fcs:DataView>` element. `<fcs:Resource>` is the top-level element and `MAY` contain zero or more `<fcs:DataView>` elements and `MAY` contain zero or more `<fcs:ResourceFragment>` elements. A `<fcs:ResourceFragment>` element `MUST` contain one or more `<fcs:DataView>` elements.
     203
     204The elements `<fcs:Resource>`, `<fcs:ResourceFragment>` and `<fcs:DataView>` `MAY` carry a `@pid` and/or a `@ref` attribute, which allows linking to the original data represented by the Resource, Resource Fragment, or Data View. A `@pid` attribute `MUST` contain a valid persistent identifier, a `@ref` `MUST` contain valid URI, i.e. a "plain" URI without the additional semantics of being a persistent reference.
    199205
    200206Endpoints `MUST` use the identifier `http://clarin.eu/fcs/1.0` for the ''responseItemType'' (= content for the `<sru:recordSchema>` element) in SRU responses.
    201207
    202 Endpoints `MAY` serialize hits as multiple ''Data Views'', however they `MUST` provide the Generic Hits (HITS) ''Data View'' either encoded as a  ''Resource Fragment'' (if applicable), or otherwise within the ''Resource'' (if there is no reasonable resource fragment). Other ''Data Views'' `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata data view would most likely be put directly under a ''Resource'' and a ''Data View'' representing some annotation layers directly around the hit is more likely to belong in within a ''Resource Fragment''.
     208Endpoints `MAY` serialize hits as multiple Data Views, however they `MUST` provide the Generic Hits (HITS) Data View either encoded as a  Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable resource fragment). Other Data Views `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata data view would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment.
    203209
    204210Some examples:
     
    213219 * [=#XREF_Example_2]Example 2: a ''Resource'' with a ''Resource Fragment'', that has a ''Data View''
    214220{{{#!xml
    215 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/00-15">
     221<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/08-15">
    216222  <fcs:ResourceFragment>
    217223    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
     
    221227</fcs:Resource>
    222228}}}
    223  * [=#XREF_Example_2]Example 3: a ''Resource'' with a ''Data View''  and a ''Resource Fragment'', that has a ''Data View''
     229 * [=#XREF_Example_2]Example 3: a ''Resource'' with a ''Data View'' and a ''Resource Fragment'', that has a ''Data View''
    224230{{{#!xml
    225231<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0"
    226               pid="http://hdl.handle.net/4711/00-15" ref="http://repos.example.org/file/text_00_15.html">
     232              pid="http://hdl.handle.net/4711/08-15" ref="http://repos.example.org/file/text_08_15.html">
    227233  <fcs:DataView type="application/x-cmdi+xml"
    228                 pid="http://hdl.handle.net/4711/00-15-1" ref="http://repos.example.org/file/00_15_1.cmdi">
     234                pid="http://hdl.handle.net/4711/08-15-1" ref="http://repos.example.org/file/08_15_1.cmdi">
    229235      <!-- data view content omitted -->
    230236  </fcs:DataView>
    231   <fcs:ResourceFragment pid="http://hdl.handle.net/4711/00-15-2" ref="http://repos.example.org/file/text_00_15.html#sentence2">
     237  <fcs:ResourceFragment pid="http://hdl.handle.net/4711/08-15-2" ref="http://repos.example.org/file/text_08_15.html#sentence2">
    232238    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
    233239      <!-- data view content omitted -->
     
    237243}}}
    238244
    239 *TODO*: explain examples.
     245[#XREF_Example_2 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`.
     246
     247[#XREF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to Example 1, the endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document.
     248
     249The most complex [#XREF_Example_3 Example 3] is similar to Example 2, i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients.
     250All entities of the Hit can be referenced by a persistent identifier and an URI. The complete Resource is referencable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referencable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referencable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`.   
    240251
    241252
     
    287298Endpoints or Clients `MUST` support CQL conformance ''Level 2'' (as defined in [#REF_OASIS_CQL OASIS-CQL, section 6]), i.e. be able to ''parse'' (Endpoints) or ''serialize'' (Clients) all of CQL and respond with appropriate error messages to the search/retrieve protocol interface.
    288299
    289 '''NOTE''': this does ''not imply'', that Endpoints are ''required'' support for all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support.
     300'''NOTE''': this does ''not imply'', that Endpoints are ''required'' to support all of CQL, but rather that they are able to ''parse'' all of CQL and generate the appropriate error message, if a query includes a feature they do not support.
    290301
    291302Endpoints `MUST` generate diagnostics according to [#REF_SRU_12 OASIS-SRU-12, Appendix C] for error conditions or to indicate unsupported features. Unfortunately, the OASIS specification does not provides a comprehensive list of diagnostics for CQL related errors. Therefore, Endpoints `MUST` use diagnostics from [#REF_LOC_DIAG LOC-DIAG, section "Diagnostics Relating to CQL"] for CQL related errors.