Changes between Version 15 and Version 16 of FCS-Specification-ScrapBook


Ignore:
Timestamp:
02/04/14 16:58:44 (10 years ago)
Author:
oschonef
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FCS-Specification-ScrapBook

    v15 v16  
    66 3. Basic KWIC records has no provision for multiple "highlight" hits
    77 4. Clear recommendation for using Resource and !ResouceFragment
    8 
     8 5. What about recursiveness in Resource (see current schema)? What is the use case?
    99
    1010== General ideas / design goals towards better specification ==
     
    1515 2. Better structure of document (and don't include aggregation stuff; that's a different specification; implementors of endpoints should not need to worry about aggregator implementation)
    1616 3. Keep XML sanity always in mind (so there are no namespace issues as in CMDI)
    17  4. Honor and use extension hooks provided by SRU/CQL
     17 4. Drop the restivenesses of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)`
     18 5. Honor and use extension hooks provided by SRU/CQL
    1819
    1920== Proposal for new specification ==
     
    131132 ''Basic profile''::
    132133   Endpoints `MUST` support ''term-only'' queries. \\
    133    Endpoints `SHOULD` support ''terms'' combined with boolean operator (''AND'' and ''OR'') queries, including subqueries. Endpoints `MAY` support the ''NOT'' or ''PROX'' operators. If an endpoint does not support such a query, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\
     134   Endpoints `SHOULD` support ''terms'' combined with boolean operator (''AND'' and ''OR'') queries, including subqueries. Endpoints `MAY` support the ''NOT'' or ''PROX'' operators. If an endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\
    134135   Examples for valid CQL queries :
    135136{{{
     
    142143cat AND (mouse OR "lazy dog")
    143144}}}
    144    The endpoint is `MUST` perform the query on an annotation tier, that makes the most sense for the user, i.e. the textual content for a text corpus resource or the orthographic transcription of a spoken language corpus. Endpoints are `RECOMMENDED` to perform the query case-sensitive.\\
     145   The endpoint is `MUST` perform the query on an annotation tier, that makes the most sense for the user, i.e. the textual content for a text corpus resource or the orthographic transcription of a spoken language corpus. Endpoints `SHOULD` perform the query case-sensitive.\\
    145146   Endpoint `MUST NOT` silently accept queries that include CQL features besides ''term-only'' and ''terms'' combined with boolean operator queries, i.e. queries involving context sets, etc.
    146147
     
    177178
    178179=== Result Format ===
    179 CLARIN-FCS uses a customized format for returning results. The following section describes the result result format and section [#query Performing Queries and returning results].
     180CLARIN-FCS uses a customized format for returning results. ''Resource'' and ''Resource Fragments'' serve as containers for hit results, that are presented in one or more  ''Data View''. The following section describes the result result and data view format and section [#query Performing Queries and returning results] will describe, how hits are embedded within SRU responses.
    180181
    181182==== Resource and !ResourceFragment ====
    182183To encode search results, CLARIN-FCS supports two building blocks:
    183184 Resources::
    184    A ''resource'' is an searchable entity at an Endpoint, such as a text corpus, ABC
     185   A ''Resource'' is an searchable entity at an Endpoint, such as a text corpus or an multi-modal corpus. A resource `SHOULD` be a self-contained unit, i.e. not a sentence in a text corpus or a time interval in an audio transcription.
    185186 Resource Fragments::
    186    Yada ...
    187 
    188 
    189 ''' WIP: will be reformulated '''
    190 
    191 In CLARIN-FCS, each {{{<sru:record>}}} element represents one hit within the ''resource'', which is encoded as a {{{<fcs:Resource>}}} element. Each resource shall be identified a persistent identifier (or, less preferably, a endpoint unique URI). The correct resource to return here is the most precise unit of data that is directly addressable as a "whole".  The hit may contain a ''resource fragment'', which is encoded as as a {{{<fcs:ResourceFragment>}}} element. The resource fragment shall be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using resource fragments is optional, but encouraged.
    192 
    193 The actual hit within a resource is provided encoded as a ''data view'' format and is serialized as a {{{<fcs:DataView>}}} element inside either the {{{<fcs:Resource>}}} or {{{<fcs:ResourceFragment>}}} element. Each hit may be serialized as multiple data views, however the keyword-in-context (KWIC) data view is mandatory with the resource fragment (if applicable), or otherwise within the resource (if there is no reasonable resource fragment). Other data views should be put in a place that is logical for their content (as is to be determined by the endpoint. E.g. a metadata data view would most likely be put directly under a resource. On the other hand a data view representing some annotation layers directly around the hit is more likely to belong in within the resource fragment.
    194 
    195 Each entity (i.e. {{{<fcs:Resource>}}}, {{{<fcs:ResourceFragment>}}} or {{{<fcs:DataView>}}} element) contains a {{{ref}}} attribute, which points to the original data represented by the resource, resource fragment, or data view as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a web-page describing a corpus or collection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. Endpoints should provide links that are as specific as possible/logical.
    196 
    197 For CLARIN-FCS, a custom record schema has been defined. The ''record schema identifier'' for this schema is {{{http://clarin.eu/fcs/1.0}}} and the appropriate XML Schema can be found at [source:FederatedSearch/Resource.xsd Resource.xsd] ([source:FederatedSearch/Resource.xsd?format=txt download]).
    198 
     187   A ''Resource Fragment'' is smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription.
     188
     189Each ''Resource'' `SHOULD` be identified by a persistent identifier. A ''Resource'' `MAY` be identified by an endpoint unique URI. A ''Resource'' `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A ''Resource'' `SHOULD` contain a ''Resource Fragment'', if the hit consists of just a part of the ''Resource'' unit, if the hit is a sentence within a large text. A ''Resource Fragment'' `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using ''Resource Fragments'' is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a ''Resource Fragment'', the actual hit `SHOULD` be encoded as a ''Data View'' that is encloded in a ''Resource Fragment''.
     190
     191Endpoints `SHOULD` always provide a link to the resource itself, i.e. by supplying the persistent identifier o the ''Resource'' or providing a URI to reference the ''Resource''. If direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` use a URI to link to a web-page describing a corpus or collection (including instruction on how to obtain it). Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed, the ''Resource Fragment'' `SHOULD NOT` contain a persistent identifier or an URI.
     192
     193''Resource'' and ''Resource Fragment'' are serialized in XML and Endpoints `MUST` generate responses, that are valid according to the XML schema "[source:FederatedSearch/Resource.xsd Resource.xsd]" ([source:FederatedSearch/Resource.xsd?format=txt download]). A ''Resource'' is encoded in the form of a `<fcs:Resource>` element, a ''Resource Fragment'' in the form of a `<fcs:ResourceFragment>` element. The content of a ''Data View'' is wrapped in a `<fcs:DataView>` element. `<fcs:Resource>` is the top-level element and `MAY` contain zero or more `<fcs:DataView>` elements and `MAY` contain zero or more `<fcs:ResourceFragment>` elements. A `<fcs:ResourceFragment>` element `MUST` contain one or more `<fcs:DataView>` elements. The elements `<fcs:Resource>`, `<fcs:ResourceFragment>` and `<fcs:DataView>` `MAY` carry a `@pid` and/or a `@ref` attribute, which allows linking to the original data represented by the resource, resource fragment, or data view. A `@pid` attribute `MUST` contain a valid persistent identifier, a `@ref` `MUST` contain valid URI (without the additional semantics of being persistent reference).
     194
     195Endpoints `MUST` use the identifier `http://clarin.eu/fcs/1.0` for the ''responseItemType'' (= content for the `<sru:recordSchema>` element) in SRU responses.
     196
     197Endpoints `MAY` serialize hits as multiple ''Data Views'', however they `MUST` provide the Generic Hits (HITS) ''Data View'' either encoded as a  ''Resource Fragment'' (if applicable), or otherwise within the ''Resource'' (if there is no reasonable resource fragment). Other ''Data Views'' `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata data view would most likely be put directly under a ''Resource'' and a ''Data View'' representing some annotation layers directly around the hit is more likely to belong in within a ''Resource Fragment''.
     198
     199Some examples:
     200 * [=#XREF_Example_1]Example 1: a ''Resource'' with a ''Data View''
     201{{{#!xml
     202<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/00-15">
     203  <fcs:DataView type="application/x-clarin-fcs-hits+xml">
     204      <!-- data view content omitted -->
     205  </fcs:DataView>
     206</fcs:Resource>
     207}}}
     208 * [=#XREF_Example_2]Example 2: a ''Resource'' with a ''Resource Fragment'', that has a ''Data View''
     209{{{#!xml
     210<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/00-15">
     211  <fcs:ResourceFragment>
     212    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
     213      <!-- data view content omitted -->
     214    </fcs:DataView>
     215  </fcs:ResourceFragment>
     216</fcs:Resource>
     217}}}
     218 * [=#XREF_Example_2]Example 3: a ''Resource'' with a ''Data View''  and a ''Resource Fragment'', that has a ''Data View''
     219{{{#!xml
     220<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0"
     221              pid="http://hdl.handle.net/4711/00-15" ref="http://repos.example.org/file/text_00_15.html">
     222  <fcs:DataView type="application/x-cmdi+xml"
     223                pid="http://hdl.handle.net/4711/00-15-1" ref="http://repos.example.org/file/00_15_1.cmdi">
     224      <!-- data view content omitted -->
     225  </fcs:DataView>
     226  <fcs:ResourceFragment pid="http://hdl.handle.net/4711/00-15-2" ref="http://repos.example.org/file/text_00_15.html#sentence2">
     227    <fcs:DataView type="application/x-clarin-fcs-hits+xml">
     228      <!-- data view content omitted -->
     229    </fcs:DataView>
     230  </fcs:ResourceFragment>
     231</fcs:Resource>
     232}}}
     233
     234*TODO*: explain examples.
    199235
    200236