Changes between Version 26 and Version 27 of FCS-Specification-ScrapBook


Ignore:
Timestamp:
02/11/14 09:51:40 (10 years ago)
Author:
oschonef
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FCS-Specification-ScrapBook

    v26 v27  
    22
    33== Issues with current document ==
    4  1. Uncomprehensible and not well structures :(
     4 1. Uncomprehensible and not well structured :(
    55 2. Resource enumeration (aka scan on fcs.resource) rather complex and unintuitive
    66 3. Basic KWIC records has no provision for multiple "highlight" hits
     
    1515 2. Better structure of document (and don't include aggregation stuff; that's a different specification; implementors of endpoints should not need to worry about aggregator implementation)
    1616 3. Keep XML sanity always in mind (so there are no namespace issues as in CMDI)
    17  4. Drop the recursiveness of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)`
    18  5. Drop the KWIC data view in favor of HITS data view; the latter will allow for multiple hits
    19  6. Honor and use extension hooks provided by SRU/CQL
    20  7. Non-normative stuff
     17 4. Drop resource enumeration in favor of endpoint resource description
     18 5. Drop the recursiveness of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)`
     19 6. Drop the KWIC data view in favor of HITS data view; the latter will allow for multiple hits
     20 7. Honor and use extension hooks provided by SRU/CQL
     21 8. Non-normative stuff
    2122   1. Endpoint specific extension hooks, e.g. to avoid tag abuse of !DataView. Resource.xsd could provide an extension hook, so arbitrary XML could also be embedded.
    2223   2. Clients can put query parameters at @ref to allow hit highlighting on their systems
     
    118119
    119120 LOC-SRU12[=#REF_LOC_SRU_12]::
    120     SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress,\\
     121    SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\
    121122    [http://www.loc.gov/standards/sru/sru-1-2.html]
    122123
     
    125126    [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html]
    126127
     128=== Non-Normative References ===
     129 RFC6838[=#REF_RFC_6838]::
     130    Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\
     131    [http://www.ietf.org/rfc/rfc6838.txt]
     132 RFC3023[=#REF_RFC_3023]::
     133    XML Media Types, IETF RFC 3023, January 2001, \\
     134    [http://www.ietf.org/rfc/rfc3023.txt]
    127135
    128136== CLARIN-FCS Interface Specification ==
     
    167175 ''Basic profile''::
    168176   Endpoints `MUST` support ''term-only'' queries. \\
    169    Endpoints `SHOULD` support ''terms'' combined with boolean operator (''AND'' and ''OR'') queries, including subqueries. Endpoints `MAY` support the ''NOT'' or ''PROX'' operators. If an endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\
     177   Endpoints `SHOULD` support ''terms'' combined with boolean operator queries (''AND'' and ''OR''), including subqueries. Endpoints `MAY` also support ''NOT'' or ''PROX'' operator queries. If an endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\
    170178   Examples for valid CQL queries for the ''basic profile'':
    171179{{{
     
    185193   '''NOTE''': the extended profile is not yet defined and will be part of a future CLARIN-FCS specification.
    186194
    187 Endpoints and Clients `MUST` support the ''basic profile''. Endpoints and Clients `MUST NOT` claim to support the ''extended profile''.
     195Endpoints and Clients `MUST` support the ''basic profile''. For now, Endpoints and Clients `MUST NOT` claim to support the ''extended profile''.
    188196
    189197
     
    198206   A ''Resource Fragment'' is smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription.
    199207
    200 Each Resource `SHOULD` be identified by a persistent identifier. A Resource `MAY` be identified by an endpoint unique URI. A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that is encoded in a Resource Fragment.
    201 
    202 Endpoints `SHOULD` always provide a link to the resource itself, i.e. by supplying the persistent identifier of the Resource or providing an URI. If direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` use a URI to link to a web-page describing the corpus or collection,including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.
     208A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that is encoded in a Resource Fragment.
     209
     210Endpoints `SHOULD` always provide a links to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing an Endpoint unique URI. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.
    203211
    204212If an Endpoint can provide both, a persistent identifier as well as an URI, for either Resource or Resource Fragment, they `SHOULD` provide both. When working with results, Clients `SHOULD` prefer persistent identifiers over regular URIs.
     
    212220Endpoints `MAY` serialize hits as multiple Data Views, however they `MUST` provide the Generic Hits (HITS) Data View either encoded as a  Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable resource fragment). Other Data Views `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata data view would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment.
    213221
    214 Some examples:
    215  * [=#XREF_Example_1]Example 1: a ''Resource'' with a ''Data View''
     222[=#REF_Example_1]Example 1:
    216223{{{#!xml
    217224<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/00-15">
     
    221228</fcs:Resource>
    222229}}}
    223  * [=#XREF_Example_2]Example 2: a ''Resource'' with a ''Resource Fragment'', that has a ''Data View''
     230This example shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`.
     231
     232[=#REF_Example_2]Example 2:
    224233{{{#!xml
    225234<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/08-15">
     
    231240</fcs:Resource>
    232241}}}
    233  * [=#XREF_Example_2]Example 3: a ''Resource'' with a ''Data View'' and a ''Resource Fragment'', that has a ''Data View''
     242This example shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document.
     243
     244[=#REF_Example_3]Example 3:
    234245{{{#!xml
    235246<fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0"
     
    246257</fcs:Resource>
    247258}}}
    248 
    249 [#XREF_Example_2 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`.
    250 
    251 [#XREF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to Example 1, the endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document.
    252 
    253 The most complex [#XREF_Example_3 Example 3] is similar to Example 2, i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients.
     259The most complex example is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients.
    254260All entities of the Hit can be referenced by a persistent identifier and an URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`.   
    255261
    256262
    257263==== Data View ====
    258 A ''Data View'' serves as a container for representing search results within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e they are deliberately kept open to allow further extensions with more supported data view formats.
    259 
     264A ''Data View'' serves as a container for representing search results within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e they are deliberately kept open to allow further extensions with more supported data view formats. Each Data View is identified by a MIME type ([#REF_RFC_6838 RFC6838], [#REF_RFC_3023 RFC3023]). If no existing MIME type can be used, implementors `SHOULD` define a properer private mime type. The type if the Data View is recorded in the `@type` attribute if the `<fcs:DataView>` element.
     265
     266The following formats are defined as part of the specification:
     267 Generic Hits (HITS)::
     268   Yada ...
     269 Component Metadata (CMDI)::
     270   Yada ...
     271 Images (IMG)::
     272   Yada ...
     273   
    260274*WIP*
    261 The type of each data view is identified by the {{{type}}} attribute of the {{{<fcs:DataView>}}} element. The value if defined to be a [http://en.wikipedia.org/wiki/MIME_Type MIME type]. If no existing MIME type can be used, implementors are encouraged to define a properer private mime type. The following formats are currently being considered:
     275The type of each data view is identified by the {{{type}}} attribute of the {{{<fcs:DataView>}}} element. The value if defined to be a [http://en.wikipedia.org/wiki/MIME_Type MIME type]. The following formats are currently being considered:
    262276 Keyword-In-Context (KWIC)::
    263277   Description: a keyword-in-context view, where each hit should be presented within the context of a complete sentence (if possible) or any other reasonable unit of context (e.g. if sentences cannot be determined by the endpoint). The keyword-in-context data view is '''mandatory''' for all endpoints. The appropriate XML schema can be found at [source:FederatedSearch/Resource-KWIC.xsd Resource-KWIC.xsd] ([source:FederatedSearch/Resource-KWIC.xsd?format=txt download]). \\
     
    282296
    283297
     298=== Endpoint Description and Identification ===
     299
     300Yada Yada Yada ...
     301
     302
     303== CLARIN-FCS to SRU/CQL binding ==
     304
    284305=== SRU/CQL ===
    285306SRU (!Search/Retrieve via URL) specifies a general communication protocol for searching and retrieving records and the CQL (Contextual Query Language) specifies a extensible query language. CLARIN-FCS is built on SRU 1.2. A subsequent specification may be built on SRU 2.0.
     
    306327
    307328
    308 == CLARIN-FCS to SRU/CQL binding ==
    309 
    310 Yada yada yada ...
    311 
    312329=== Endpoint Identification ===
    313330