Changes between Version 26 and Version 27 of FCS-Specification-ScrapBook
- Timestamp:
- 02/11/14 09:51:40 (10 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
FCS-Specification-ScrapBook
v26 v27 2 2 3 3 == Issues with current document == 4 1. Uncomprehensible and not well structure s:(4 1. Uncomprehensible and not well structured :( 5 5 2. Resource enumeration (aka scan on fcs.resource) rather complex and unintuitive 6 6 3. Basic KWIC records has no provision for multiple "highlight" hits … … 15 15 2. Better structure of document (and don't include aggregation stuff; that's a different specification; implementors of endpoints should not need to worry about aggregator implementation) 16 16 3. Keep XML sanity always in mind (so there are no namespace issues as in CMDI) 17 4. Drop the recursiveness of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)` 18 5. Drop the KWIC data view in favor of HITS data view; the latter will allow for multiple hits 19 6. Honor and use extension hooks provided by SRU/CQL 20 7. Non-normative stuff 17 4. Drop resource enumeration in favor of endpoint resource description 18 5. Drop the recursiveness of Resource, content models should be: `Resource (DataView*, ResourceFragment*)` and `ResourceFragment (DataView*)` 19 6. Drop the KWIC data view in favor of HITS data view; the latter will allow for multiple hits 20 7. Honor and use extension hooks provided by SRU/CQL 21 8. Non-normative stuff 21 22 1. Endpoint specific extension hooks, e.g. to avoid tag abuse of !DataView. Resource.xsd could provide an extension hook, so arbitrary XML could also be embedded. 22 23 2. Clients can put query parameters at @ref to allow hit highlighting on their systems … … 118 119 119 120 LOC-SRU12[=#REF_LOC_SRU_12]:: 120 SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\121 SRU Version 1.2: SRU !Search/Retrieve Operation, Library of Congress, \\ 121 122 [http://www.loc.gov/standards/sru/sru-1-2.html] 122 123 … … 125 126 [http://www.loc.gov/standards/sru/diagnostics/diagnosticsList.html] 126 127 128 === Non-Normative References === 129 RFC6838[=#REF_RFC_6838]:: 130 Media Type Specifications and Registration Procedures, IETF RFC 6838, January 2013, \\ 131 [http://www.ietf.org/rfc/rfc6838.txt] 132 RFC3023[=#REF_RFC_3023]:: 133 XML Media Types, IETF RFC 3023, January 2001, \\ 134 [http://www.ietf.org/rfc/rfc3023.txt] 127 135 128 136 == CLARIN-FCS Interface Specification == … … 167 175 ''Basic profile'':: 168 176 Endpoints `MUST` support ''term-only'' queries. \\ 169 Endpoints `SHOULD` support ''terms'' combined with boolean operator (''AND'' and ''OR'') queries, including subqueries. Endpoints `MAY` support the ''NOT'' or ''PROX'' operators. If an endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\177 Endpoints `SHOULD` support ''terms'' combined with boolean operator queries (''AND'' and ''OR''), including subqueries. Endpoints `MAY` also support ''NOT'' or ''PROX'' operator queries. If an endpoint does not support a query, i.e. the used operators are not supported by the Endpoint, it `MUST` return an appropriate error message using the appropriate SRU diagnostic. \\ 170 178 Examples for valid CQL queries for the ''basic profile'': 171 179 {{{ … … 185 193 '''NOTE''': the extended profile is not yet defined and will be part of a future CLARIN-FCS specification. 186 194 187 Endpoints and Clients `MUST` support the ''basic profile''. Endpoints and Clients `MUST NOT` claim to support the ''extended profile''.195 Endpoints and Clients `MUST` support the ''basic profile''. For now, Endpoints and Clients `MUST NOT` claim to support the ''extended profile''. 188 196 189 197 … … 198 206 A ''Resource Fragment'' is smaller unit in a ''Resource'', i.e. a sentence in a text corpus or a time interval in an audio transcription. 199 207 200 Each Resource `SHOULD` be identified by a persistent identifier. A Resource `MAY` be identified by an endpoint unique URI.A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that is encoded in a Resource Fragment.201 202 Endpoints `SHOULD` always provide a link to the resource itself, i.e. by supplying the persistent identifier of the Resource or providing an URI. If direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` use a URI to link to a web-page describing the corpus or collection,including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI.208 A Resource `SHOULD` be the most precise unit of data that is directly addressable as a "whole". A Resource `SHOULD` contain a Resource Fragment, if the hit consists of just a part of the Resource unit, if the hit is a sentence within a large text. A Resource Fragment `SHOULD` be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using Resource Fragments is `OPTIONAL`, but Endpoints are encouraged to use them. If an Endpoint encodes a hit with a Resource Fragment, the actual hit `SHOULD` be encoded as a Data View that is encoded in a Resource Fragment. 209 210 Endpoints `SHOULD` always provide a links to the resource itself, i.e. each Resource or Resource Fragment `SHOULD` be identified by a persistent identifier or providing an Endpoint unique URI. Even if direct linking is not possible, i.e. due to licensing issues, the Endpoints `SHOULD` provide a URI to link to a web-page describing the corpus or collection, including instruction on how to obtain it. Endpoints `SHOULD` provide links that are as specific as possible (and logical), i.e. if a sentence within a resource cannot be addressed directly, the Resource Fragment `SHOULD NOT` contain a persistent identifier or an URI. 203 211 204 212 If an Endpoint can provide both, a persistent identifier as well as an URI, for either Resource or Resource Fragment, they `SHOULD` provide both. When working with results, Clients `SHOULD` prefer persistent identifiers over regular URIs. … … 212 220 Endpoints `MAY` serialize hits as multiple Data Views, however they `MUST` provide the Generic Hits (HITS) Data View either encoded as a Resource Fragment (if applicable), or otherwise within the Resource (if there is no reasonable resource fragment). Other Data Views `SHOULD` be put in a place that is logical for their content (as is to be determined by the Endpoint), e.g. a metadata data view would most likely be put directly below Resource and a Data View representing some annotation layers directly around the hit is more likely to belong within a Resource Fragment. 213 221 214 Some examples: 215 * [=#XREF_Example_1]Example 1: a ''Resource'' with a ''Data View'' 222 [=#REF_Example_1]Example 1: 216 223 {{{#!xml 217 224 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/00-15"> … … 221 228 </fcs:Resource> 222 229 }}} 223 * [=#XREF_Example_2]Example 2: a ''Resource'' with a ''Resource Fragment'', that has a ''Data View'' 230 This example shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. 231 232 [=#REF_Example_2]Example 2: 224 233 {{{#!xml 225 234 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="http://hdl.handle.net/4711/08-15"> … … 231 240 </fcs:Resource> 232 241 }}} 233 * [=#XREF_Example_2]Example 3: a ''Resource'' with a ''Data View'' and a ''Resource Fragment'', that has a ''Data View'' 242 This example shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to [#REF_Example_1 Example 1], the endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. 243 244 [=#REF_Example_3]Example 3: 234 245 {{{#!xml 235 246 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" … … 246 257 </fcs:Resource> 247 258 }}} 248 249 [#XREF_Example_2 Example 1] shows a simple hit, which is encoded in one Data View of type ''Generic Hits'' embedded within a Resource. The type of the Data View is identified by the MIME type `application/x-clarin-fcs-hits+xml`. The Resource is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. 250 251 [#XREF_Example_2 Example 2] shows a hit encoded as a Resource Fragment embedded within a Resource. The actual hit is again encoded as one Data View of type ''Generic Hits''. The hit is not directly referenceable, but the Resource, in which hit occurred, is referenceable by the persistent identifier `http://hdl.handle.net/4711/08-15`. In contrast to Example 1, the endpoint decided to provide a "semantically richer" encoding and embedded the hit using a Resource Fragment within the Resource to indicate that the hit is a part of a larger resource, e.g. a sentence in a text document. 252 253 The most complex [#XREF_Example_3 Example 3] is similar to Example 2, i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. 259 The most complex example is similar to [#REF_Example_2 Example 2], i.e. it shows a hit is encoded as one ''Generic Hits'' Data View in a Resource Fragment, that is embedded in a Resource. In contrast to Example 2, another Data View of type ''CMDI'' is embedded directly within the Resource. An Endpoint can use this type of Data View to directly provide CMDI metadata about the Resource to Clients. 254 260 All entities of the Hit can be referenced by a persistent identifier and an URI. The complete Resource is referenceable by either the persistent identifier `http://hdl.handle.net/4711/08-15` or the URI `http://repos.example.org/file/text_08_15.html` and the CMDI metadata record in the CMDI Data View is referenceable either by the persistent identifier `http://hdl.handle.net/4711/08-15-1` or the URI `http://repos.example.org/file/08_15_1.cmdi`. The actual hit in the Resource Fragment is also directly referenceable by either the persistent identifier `http://hdl.handle.net/4711/00-15-2` or the URI `http://repos.example.org/file/text_08_15.html#sentence2`. 255 261 256 262 257 263 ==== Data View ==== 258 A ''Data View'' serves as a container for representing search results within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e they are deliberately kept open to allow further extensions with more supported data view formats. 259 264 A ''Data View'' serves as a container for representing search results within CLARIN-FCS. Data Views are designed to allow for different representations of results, i.e they are deliberately kept open to allow further extensions with more supported data view formats. Each Data View is identified by a MIME type ([#REF_RFC_6838 RFC6838], [#REF_RFC_3023 RFC3023]). If no existing MIME type can be used, implementors `SHOULD` define a properer private mime type. The type if the Data View is recorded in the `@type` attribute if the `<fcs:DataView>` element. 265 266 The following formats are defined as part of the specification: 267 Generic Hits (HITS):: 268 Yada ... 269 Component Metadata (CMDI):: 270 Yada ... 271 Images (IMG):: 272 Yada ... 273 260 274 *WIP* 261 The type of each data view is identified by the {{{type}}} attribute of the {{{<fcs:DataView>}}} element. The value if defined to be a [http://en.wikipedia.org/wiki/MIME_Type MIME type]. If no existing MIME type can be used, implementors are encouraged to define a properer private mime type.The following formats are currently being considered:275 The type of each data view is identified by the {{{type}}} attribute of the {{{<fcs:DataView>}}} element. The value if defined to be a [http://en.wikipedia.org/wiki/MIME_Type MIME type]. The following formats are currently being considered: 262 276 Keyword-In-Context (KWIC):: 263 277 Description: a keyword-in-context view, where each hit should be presented within the context of a complete sentence (if possible) or any other reasonable unit of context (e.g. if sentences cannot be determined by the endpoint). The keyword-in-context data view is '''mandatory''' for all endpoints. The appropriate XML schema can be found at [source:FederatedSearch/Resource-KWIC.xsd Resource-KWIC.xsd] ([source:FederatedSearch/Resource-KWIC.xsd?format=txt download]). \\ … … 282 296 283 297 298 === Endpoint Description and Identification === 299 300 Yada Yada Yada ... 301 302 303 == CLARIN-FCS to SRU/CQL binding == 304 284 305 === SRU/CQL === 285 306 SRU (!Search/Retrieve via URL) specifies a general communication protocol for searching and retrieving records and the CQL (Contextual Query Language) specifies a extensible query language. CLARIN-FCS is built on SRU 1.2. A subsequent specification may be built on SRU 2.0. … … 306 327 307 328 308 == CLARIN-FCS to SRU/CQL binding ==309 310 Yada yada yada ...311 312 329 === Endpoint Identification === 313 330