8 | | |
9 | | The design of the federated search system consists of two major parts: |
10 | | 1. The communication protocol and the query language. |
11 | | 2. The aggregator / user interface. |
12 | | {{{#!comment |
13 | | |
14 | | NOTE TO EDITOR: rework! |
15 | | |
16 | | This document deals with both parts. The first part of this document deals with the user interface and the global design thoughts, whereas the second part deals with the technical specification (communication protocol and query language). |
17 | | }}} |
18 | | |
19 | | The federated search depends on the specification and implementation of the VLO, the metadata search, the virtual collection registry, and CMDI. |
20 | | |
21 | | In general each CLARIN-center participating will provide at least the following services: |
22 | | * Provide one or more resources |
23 | | * Support Content-search within those resources |
24 | | * Return search-hits in the agreed-upon format |
25 | | * Support query-expansion if possible |
26 | | * Support the selection of a sub-part of the offered resources to perform content-search on that sup-part |
27 | | * Provide support for the sub-part selection by providing CMDI metadata at the same, reasonable , granularity |
28 | | |
29 | | = Global Design Thoughts = |
30 | | |
31 | | == Overview == |
32 | | |
33 | | {{{#!comment |
34 | | |
35 | | NOTE TO EDITOR: redundant? |
36 | | |
37 | | The design of the federated search system consists of two major parts: |
38 | | 1. The communication protocol and the query language. |
39 | | 2. The aggregator / user interface. |
40 | | This part discusses the user-interface, wishes we have for the user interface, and how this affects or is affected by the protocol choice. |
41 | | }}} |
42 | | |
43 | | The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that for a proper understanding one needs to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/. |
| 21 | The CLARIN-FCS depends on the specification and implementation of the Virtual Language Observatory (VLO), the metadata search, the Virtual Collection Registry (VCR), and CMDI. |
| 22 | |
| 23 | In general each CLARIN center participating within CLARIN-FCS will provide at least the following services: |
| 24 | * provide one or more resources |
| 25 | * support Content-search within those resources |
| 26 | * return search-hits in the agreed-upon format |
| 27 | * support query-expansion (if possible) |
| 28 | * support the selection of a sub-part of the offered resources to perform content-search on that sub-part |
| 29 | * provide support for the sub-part selection by providing CMDI metadata at the same, reasonable, granularity |
87 | | The user-interface (or aggregator) is a component that communicates with all endpoints within the federated search. Each endpoint offers some searchable resources. The content of these resources can be searched. Depending on the needs of the user and the capabilities of the endpoint several return formats are supported. The user-interface collects the responses from all the endpoints and shows these to the user. In order to facilitate this listing all endpoints will always return the results at least in the keyword-in-context format. |
88 | | |
89 | | The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that it is helpful to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/. |
90 | | |
91 | | == SRU == |
92 | | The SRU protocol supports three different operations: Explain, Scan, and SearchRetrieve. Of these, the Explain operation is the default. Parameters are given as HTTP GET or HTTP POST arguments. If no arguments are given the explain operation is performed. |
93 | | |
94 | | We again refer to the standard at http://www.loc.gov/standards/sru/ (version 1.2) where one can study the details of the protocol. Below we will only discuss the ways in which our implementations differ from the basic protocol, as well as the rest of our agreements (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora). |
95 | | |
96 | | As a namespace for our extensions to the SRU XML we use “http://clarin.eu/fcs/1.0”. The schema that validated our extension is found at: http://www.clarin.eu/system/files/Resource.xsd |
97 | | |
98 | | === Explain === |
99 | | |
100 | | This basic request serves to announce server's capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the CCS-context* set is to be used. We provide a telling example (as seen within the context of the explain response as defined on the SRU/CQL website): |
101 | | |
| 71 | From a technical perspective, CLARIN-FCS is consists of two major parts: |
| 72 | * the ''aggregator'': a user interface to perform queries and display search results |
| 73 | * several ''endpoints'': a service that is provided by CLARIN centers who like to participate in CLARIN-FCS |
| 74 | The ''aggregator'' is a component that communicates with a component called ''endpoint'', which is provided as a service by all centers who like to participate within the federated content search. Each endpoint provides access to one or more searchable ''resources''. The content of these resources is searched with the ''query'' supplied to the endpoint. The endpoint returns results to this query in one or more representations, called ''data views'', i.e. a keyword-in-context (KWIC) data view. The aggregator collects the responses from all the endpoints and displays them to the user. In order to facilitate this listing all endpoints are required to support returning results at least in the keyword-in-context data view. |
| 75 | |
| 76 | == SRU/CQL == |
| 77 | The SRU protocol supports three different operations: ''Explain'', ''Scan'', and ''!SearchRetrieve''. Of these, the ''Explain'' operation is the default. Parameters are given either as HTTP GET or HTTP POST arguments. If no arguments are given the ''Explain'' operation shall be performed by an endpoint. |
| 78 | |
| 79 | Generally, CLARIN-FCS is build upon the SRU/CQL standard (version 1.2), which is available from [http://www.loc.gov/standards/sru/ http://www.loc.gov/standards/sru/]. The following sections describe how CLARIN-FCS is mapped to SRU/CQL and how the protocol is extended to meet the requirements for CLARIN-FCS (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora). |
| 80 | |
| 81 | '''NOTE:''' Before starting to implement a endpoint at a center you are strongly encouraged to familiarize yourself with the SRU/CQL standard. |
| 82 | |
| 83 | === Explain Operation === |
| 84 | This basic request serves to announce endpoints capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the FCS-context* set is to be used. We provide a telling example (as seen within the context of the explain response as defined on the SRU/CQL website): |
104 | | <set identifier="isocat.org/datcat" name="isocat"/> |
105 | | <set identifier="http://clarin.eu/fcs/1.0" name="fcs"/> |
106 | | |
107 | | <index id="?"> |
108 | | <title lang="en">Part of Speech</title> |
109 | | <map><name set="isocat">partOfSpeech</name></map> |
110 | | </index> |
111 | | |
112 | | <index id="?"> |
113 | | <title lang="en">Words</title> |
114 | | <map><name set="fcs">word</name></map> |
115 | | </index> |
116 | | |
117 | | <index id="?"> |
118 | | <title lang="en">Phonetics</title> |
119 | | <map><name set="fcs">phonetics</name></map> |
120 | | </index> |
| 87 | <set identifier="isocat.org/datcat" name="isocat"/> |
| 88 | <set identifier="http://clarin.eu/fcs/1.0" name="fcs"/> |
| 89 | <index id="?"> |
| 90 | <title lang="en">Part of Speech</title> |
| 91 | <map> |
| 92 | <name set="isocat">partOfSpeech</name> |
| 93 | </map> |
| 94 | </index> |
| 95 | <index id="?"> |
| 96 | <title lang="en">Words</title> |
| 97 | <map> |
| 98 | <name set="fcs">word</name> |
| 99 | </map> |
| 100 | </index> |
| 101 | <index id="?"> |
| 102 | <title lang="en">Phonetics</title> |
| 103 | <map> |
| 104 | <name set="fcs">phonetics</name> |
| 105 | </map> |
| 106 | </index> |
124 | | The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: isocat.partOfSpeech, ccs.word, and ccs.phonetics. An example query could for instance be ccs.word = child. |
125 | | Note that the indexes given are completely dynamic, i.e., they can differ from endpoint to endpoint! We have a clear TODO here to agree on a common set of required/desirable indexes! |
126 | | |
127 | | The searchable indexes are used as a list in the “searchable tiers / tier-types” in the user-interface. It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. We consider it obvious that selectable tier-types trumps tier-names, e.g., “tr1, tr, trans, trans1” as a list is worse than “transcription” as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them. |
128 | | |
129 | | Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold. E.g., the annotation “up” for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better. |
130 | | |
131 | | === Scan === |
132 | | |
133 | | '''NOTE:''' this section will be slightly changed in the near future. |
134 | | |
135 | | We foresee the scan operation as a way of signaling to the calling program/user/aggregator the available resources available for searching at the endpoint. This in contrast to the definition in SRU, where scan is a way to browse a list of keywords. The value of the scanClause parameter should be '''fcs.resource'''. |
136 | | |
137 | | To this the endpoint will return a list of terms, which are searchable collections. Their identifiers can than be used to restrict the search by passing one (or more) as parameters in x-context in the searchRetrieve operation. |
138 | | |
139 | | Again, we provide a telling example: |
140 | | {{{#!xml |
141 | | <sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" > |
| 110 | The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: {{{isocat.partOfSpeech}}}, {{{fcs.word}}}, and {{{fcs.phonetics}}}. An example query could for instance be {{{fcs.word = child}}}. |
| 111 | |
| 112 | '''Note:''' The indexes given in the example are completely dynamic, i.e., they can differ from endpoint to endpoint. We have a clear TODO here to agree on a common set of required/desirable indexes! |
| 113 | |
| 114 | The searchable indexes are used as a list in the "searchable tiers / tier-types" in the user-interface. It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. It is obvious, that selectable tier-types trumps tier-names, e.g., "tr1, tr, trans, trans1" as a list is worse than "transcription" as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them. |
| 115 | |
| 116 | Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold, e.g. the annotation "up" for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better. |
| 117 | |
| 118 | === Scan Operation === |
| 119 | Within CLARIN-FCS the scan operation is used to allow the calling agent (i.e. the aggregator) to enumerate all resources available for searching at a given endpoint. This in contrast to the definition in SRU, where Scan is a way to browse a list of keywords. |
| 120 | |
| 121 | The enumeration process works as follows: |
| 122 | 1. The agent performs a initial scan operation with {{{scanClause}}} parameter of {{{fcs.resource = root}}}. |
| 123 | 2. The endpoint will return a list of {{{<sru:terms>}}}, which denote searchable collections. The element {{{<sru:value>}}} of each {{{<sru:term>}}} element is used as an identifier for a collection at an endpoint. It must be a valid !MdSelfLink (which is defined as a URI). This identifier can than be used to restrict the search by passing one (or more) as parameters in the {{{x-cmd-context}}} parameter in the !SearchRetrieve operation. The optional {{{<sru:numberOfRecords>}}} element of the {{{<sru:term>}}} element may contains the (approximate) number of searchable resources within a collection, the optional element {{{<sru:displayName>}}} should contain the human-readable name of the collection. |
| 124 | 3. To check for sub-collection, the agent performs a [http://en.wikipedia.org/wiki/Tree_traversal tree traversal]. More specifically an agent substitutes the value {{{root}}} in the {{{scanClause}}} parameter for the identifier of a collection and performs another scan operation. The result is analogous to the one from the initial scan. If this collection has no sub-resources, the endpoint must return no terms et all. |
| 125 | '''Note:''' Providing an sufficient answer to the scan operation with the initial query of {{{fcs.resource = root}}} is '''mandatory''', even if the endpoint only supports a single collection. In that case the endpoint must return only one {{{<sru:term>}}} within the {{{<sru:terms>}}}. \\ |
| 126 | '''Note:''' The value {{{root}}} used in the initial scan request is a reserved value and therefore cannot be used to identify any real resource at the endpoint (And it's not a URI anyways). |
| 127 | |
| 128 | Given the following example of collections and sub-collections (with invented Handles for collection identifiers). |
| 129 | {{{#!text |
| 130 | +-"collection 1", hdl:4711/1 |
| 131 | | +-"collection 1.1", hdl:4711/4 |
| 132 | | +-"collection 1.2", hdl:4711/5 |
| 133 | +-"collection 2", hdl:4711/2 |
| 134 | +-"collection 3", hdl:4711/3 |
| 135 | +-"collection 3.1", hdl:4711/6 |
| 136 | }}} |
| 137 | The enumeration process will be performed as follows: |
| 138 | * request for {{{fcs.resource = root}}} returns terms for "collection 1", "collection 2" and "collection 3" |
| 139 | * request for {{{fcs.resource = hdl:4711/1}}} (= "collection 1") returns terms for "collection 1.1" and "collection 1.2" |
| 140 | * request for {{{fcs.resource = hdl:4711/4}}} (= "collection 1.1") returns no terms |
| 141 | * request for {{{fcs.resource = hdl:4711/5}}} (= "collection 1.2") returns no terms |
| 142 | * request for {{{fcs.resource = hdl:4711/5}}} (= "collection 2") returns no terms |
| 143 | * request for {{{fcs.resource = hdl:4711/3}}} (= "collection 3") returns terms for "collection 3.1" |
| 144 | * request for {{{fcs.resource = hdl:4711/6}}} (= "collection 3.1") returns no terms |
| 145 | |
| 146 | '''TODO''' resource-info stuff |
| 147 | |
| 148 | A real-world example of a response to a scan request with a {{{scanClause}}} parameter of {{{fcs.resource = root}}} could look like the following: |
| 149 | {{{#!xml |
| 150 | <sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/"> |
177 | | <sru:value>hdl:1839/00-0000-0000-0003-467E-9</sru:value> |
178 | | <sru:numberOfRecords>300</sru:numberOfRecords> |
179 | | <sru:displayTerm>Annotation types</sru:displayTerm> |
180 | | </sru:term> |
181 | | <sru:term> |
182 | | <sru:value>hdl:1839/00-0000-0000-0003-4682-F</sru:value> |
183 | | <sru:numberOfRecords>400</sru:numberOfRecords> |
184 | | <sru:displayTerm>Components</sru:displayTerm> |
185 | | </sru:term> |
186 | | <sru:term> |
187 | | <sru:value>hdl:1839/00-0000-0000-0003-4692-D</sru:value> |
188 | | <sru:numberOfRecords>350</sru:numberOfRecords> |
189 | | <sru:displayTerm>Regions</sru:displayTerm> |
| 181 | <sru:value>hdl:1839/00-0000-0000-0003-467E-9</sru:value> |
| 182 | <sru:numberOfRecords>300</sru:numberOfRecords> |
| 183 | <sru:displayTerm>Annotation types</sru:displayTerm> |
| 184 | </sru:term> |
| 185 | <sru:term> |
| 186 | <sru:value>hdl:1839/00-0000-0000-0003-4682-F</sru:value> |
| 187 | <sru:numberOfRecords>400</sru:numberOfRecords> |
| 188 | <sru:displayTerm>Components</sru:displayTerm> |
| 189 | </sru:term> |
| 190 | <sru:term> |
| 191 | <sru:value>hdl:1839/00-0000-0000-0003-4692-D</sru:value> |
| 192 | <sru:numberOfRecords>350</sru:numberOfRecords> |
| 193 | <sru:displayTerm>Regions</sru:displayTerm> |
201 | | So in this scan response the endpoint announces that the CGN-Corpus (identified with the unique MdSelfLink hdl:1839/00-0000-0000-0001-53A5-2) has 3 sub-collections: |
202 | | * Annotation types (containing 300 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-467E-9) |
203 | | * Components (containing 400 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-4682-F) |
204 | | * Regions (containing 350 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-4692-D) |
205 | | |
206 | | These results can then later on be used to restrict a content search to one (or more) (sub)collections. |
207 | | |
208 | | === !SearchRetrieve === |
209 | | |
210 | | The !SearchRetrieve operation is the operation that is used for searching in the resources that are provided by the endpoint. The responds provides an XML wrapper to a set of results. Here we follow the SRU standard (verison 1.2) down to the <sru:record> elements. Each <record> represents one hit of the query on the data. |
211 | | |
212 | | Within each record we use our own XML structure that matches the concept of searchable resources. Each record contains one resource. The resource has a PID. The correct resource to return here is the most precise unit of data that is directly addressable as a “whole”. This resource might contain a resourceFragment. The resourceFragment has an offset, i.e., a resource fragment is a subset of the resource that is addressable. Using a resourceFragment is optional, but encouraged. |
213 | | |
214 | | Within both the resource and the resourceFragment there can be dataView elements. At least one kwic dataView element is required, within the resourceFragment (if applicable), otherwise within the resource (if there is no resourceFragment). Other dataViews should be put in a place that is logical for their content (as is to be determined by the endpoint, e.g., a metadata dataView would most likely be directly under a resource. On the other hand a dataView representing some annotation layers directly around the hit is more likely to belong in the resourceFragment.) |
215 | | |
216 | | Each element (resource, resourceFragment, dataview) will contain a “ref” attribute, which points to the original data represented by the resource, fragment, or dataview as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a webpage describing a corpus or datacollection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. We will strive to provide links that are as specific as possible/logical. |
217 | | |
218 | | We already define several allowed dataView formats below. Further extention in the future with more supported formats is deliberately kept open. Currently an active discussion is happening within the CLARIN-D community about which formats to use. |
219 | | |
220 | | Below we also cover a second topic related to the searchRetrieve operation, the expansion of search queries. |
221 | | |
222 | | == Return formats (DataView) == |
223 | | |
224 | | There are several dataviews agreed upon. Each dataView will have an attribute “type”, which has as value the type of dataView contained. It is possible to, in the future, add different dataviews if required. It is mandatody to support the KWIC dataview (as this type is fairly straightforward to show as a list of results). |
225 | | Our KWIC dataView looks as follows: |
226 | | {{{#!xml |
227 | | <fcs:DataView type="kwic"> |
228 | | <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic"> |
229 | | <kwic:c type="left">Some text with the </kwic:c> |
230 | | <kwic:kw>keyword</kwic:kw> |
231 | | <kwic:c type="right">highlighted.</kwic:c> |
232 | | </kwic:kwic> |
| 205 | So in this scan response the endpoint announces that the CGN-Corpus (identified with the unique !MdSelfLink hdl:1839/00-0000-0000-0001-53A5-2) has 3 sub-collections: |
| 206 | * Annotation types (containing 300 elements, identified by the !MdSelfLink hdl:1839/00-0000-0000-0003-467E-9) |
| 207 | * Components (containing 400 elements, identified by the !MdSelfLink hdl:1839/00-0000-0000-0003-4682-F) |
| 208 | * Regions (containing 350 elements, identified by the !MdSelfLink hdl:1839/00-0000-0000-0003-4692-D) |
| 209 | |
| 210 | These results can then later on be used to restrict a content search to one (or more) (sub-)collections. |
| 211 | |
| 212 | === !SearchRetrieve Operation === |
| 213 | ==== Overview ==== |
| 214 | The !SearchRetrieve operation is the operation that is used for searching in the resources that are provided by the endpoint. The SRU standard defines a XML structure how to encode the results down to a record level, i.e. {{{<sru:record>}}} elements. SRU allows records to be serialized in multiple formats, so called ''record schemas''. For CLARIN-FCS, a custom record schema has been defined. The ''record schema identifier'' for this schema is {{{http://clarin.eu/fcs/1.0}}} and the appropriate XML Schema can be found at [http://www.clarin.eu/system/files/Resource.xsd http://www.clarin.eu/system/files/Resource.xsd]. |
| 215 | |
| 216 | Within CLARIN-FCS, each {{{<sru:record>}}} element represents one hit within the ''resource'', which is encoded as a {{{<fcs:Resource>}}} element. Each resource shall be identified a persistent identifier (or, less preferably, a endpoint unique URI). The correct resource to return here is the most precise unit of data that is directly addressable as a "whole". The hit may contain a ''resource fragment'', which is encoded as as a {{{<fcs:ResourceFragment>}}} element. The resource fragment shall be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using resource fragments is optional, but encouraged. |
| 217 | |
| 218 | The actual hit within a resource is provided encoded as a ''data view'' format and is serialized as a {{{<fcs:DataView>}}} element inside either the {{{<fcs:Resource}}} or {{{<fcs:ResourceFragment>}}} element. Each hit may be serialized as multiple data views, however the keyword-in-context (KWIC) data view is mandatory with the resource fragment (if applicable), or otherwise within the resource (if there is no reasonable resource fragment). Other data views should be put in a place that is logical for their content (as is to be determined by the endpoint. E.g. a metadata data view would most likely be put directly under a resource. On the other hand a data view representing some annotation layers directly around the hit is more likely to belong in within the resource fragment. |
| 219 | |
| 220 | Each entity (i.e. {{{<fcs:Resource>}}}, {{{<fcs:ResourceFragment>}}} or {{{<fcs:DataView>}}} element) contains a {{{ref}}} attribute, which points to the original data represented by the resource, resource fragment, or data view as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a web-page describing a corpus or collection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. Endpoints should provide links that are as specific as possible/logical. |
| 221 | |
| 222 | |
| 223 | ==== Data View and Data View formats ==== |
| 224 | The data views are designed to allow for different representations of search results within CLARIN-FCS. They are deliberately kept open to allow further extensions in the future with more supported data view formats. |
| 225 | |
| 226 | The type of each data view is identified by the {{{type}}} attribute of the {{{<fcs:DataView}}} element. The value if defined to be a [http://en.wikipedia.org/wiki/MIME_Type MIME type]. If no existing MIME type can be used, implementors are encouraged to define a properer private mime type. The following formats are currently being considered: |
| 227 | Keyword-In-Context (KWIC):: |
| 228 | Description: a keyword-in-context view, where each hit should be presented within the context of a complete sentence (if possible) or any other reasonable unit of context (e.g. if sentences cannot be determined by the endpoint). The keyword-in-context data view is '''mandatory''' for all endpoints. \\ |
| 229 | Type: {{{application/x-clarin-fcs-kwic+xml}}} \\ |
| 230 | Example for a keyword-in-context data view: |
| 231 | {{{#!xml |
| 232 | <fcs:DataView type="application/x-clarin-fcs-kwic+xml"> |
| 233 | <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic"> |
| 234 | <kwic:c type="left">Some text with the </kwic:c> |
| 235 | <kwic:kw>keyword</kwic:kw> |
| 236 | <kwic:c type="right">highlighted.</kwic:c> |
| 237 | </kwic:kwic> |
235 | | |
236 | | Another possible dataView is currently CMDI. The metadata should be applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resourceFragment element is should apply specifically to the resourceFragment (i.e., more specialized than the metadata for the encompassing resource). |
237 | | |
238 | | Furthermore, the geographic format KML is valid as a dataView. (see http://www.opengeospatial.org/standards/kml for a specification). |
239 | | |
240 | | Finally, we also specify the possible inclusion of three different content-container dataViews: |
241 | | 1. Fulltext; this dataView simply contains a (reasonably sized) bunch of running text in between the dataView tags. By providing this format it allows the CLARIN centers that have the possibility of making available full-text un-annotated data to make available such data. Some examples here are the Goethe and the El-Pais corpora. |
242 | | 2. EAF: this dataView contains the EAF format as specified elsewhere. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture. |
243 | | 3. TCF: this dataView contains the TCF format as specified for the WEBLICHT program. By having data available in TCF (where applicable) we create a more integrated search architecture, where search results can be further processed in a weblicht toolchain. |
| 240 | CMDI metadata:: |
| 241 | Description: a CMDI metadata record applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resource fragment is should apply specifically to the resource fragment (i.e., more specialized than the metadata for the encompassing resource). \\ |
| 242 | Type: {{{application/x-cmdi+xml}}} \\ |
| 243 | Geolocation (KML):: |
| 244 | Description: A geographic location encoded in the Keyhole Markup Language. \\ |
| 245 | Type: {{{application/vnd.google-earth.kml+xml}}} |
| 246 | |
| 247 | Currently, the inclusion of three different content-container data views is being discussed: |
| 248 | 1. Fulltext; this data view simply contains a (reasonably sized) chunk of running text. This format would allow CLARIN centers to make un-annotated full-text data available. Some examples here are the Goethe and the El-Pais corpora. |
| 249 | 2. EAF: this data view contains the EAF format. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture. |
| 250 | 3. TCF: this data view contains the TCF format as specified for the !WebLicht program. The format would allow users to further process data within the !WebLicht. |
| 251 | |
| 252 | |
| 253 | '''TODO''' CONTINUE FROM HERE |
252 | | <sru:version>1.2</sru:version> |
253 | | <sru:numberOfRecords>100</sru:numberOfRecords> |
254 | | <sru:records> |
255 | | <sru:record> |
256 | | <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema> |
257 | | <sru:recordPacking>xml</sru:recordPacking> |
258 | | <sru:recordData> |
259 | | <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGA.00000" |
260 | | ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGA.00000"> |
261 | | <fcs:DataView type="kwic"> |
262 | | <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic"> |
263 | | <kwic:c type="left"> und so will ich denn hier auch noch anführen, daß |
264 | | ich in diesem Elend das neckische Gelübde getan: man solle, wenn ich |
265 | | uns erlöst und mich wieder zu Hause sähe, von mir niemals wieder |
266 | | einen Klagelaut vernehmen über den meine freiere Zimmeraussicht |
267 | | beschränkenden Nachbargiebel, den ich vielmehr jetzt recht sehnlich |
268 | | zu erblicken wünsche; ferner wollt' ich mich über Mißbehagen und |
269 | | Langeweile im deutschen Theater nie wieder beklagen, wo man doch |
270 | | immer </kwic:c> |
271 | | <kwic:kw>Gott</kwic:kw> |
272 | | <kwic:c type="right"> danken könne, unter Dach zu sein, was auch auf der |
273 | | Bühne vorgehe. </kwic:c> |
274 | | </kwic:kwic> |
275 | | </fcs:DataView> |
276 | | </fcs:Resource> |
277 | | </sru:recordData> |
278 | | <sru:recordPosition>1</sru:recordPosition> |
279 | | </sru:record> |
280 | | <!-- Some Records skipped --> |
281 | | <sru:record> |
282 | | <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema> |
283 | | <sru:recordPacking>xml</sru:recordPacking> |
284 | | <sru:recordData> |
285 | | <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGI.04846" |
286 | | ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGI.04846"> |
287 | | <fcs:DataView type="kwic"> |
288 | | <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic"> |
289 | | <kwic:c type="left">der</kwic:c> |
290 | | <kwic:kw>"Gott"</kwic:kw> |
291 | | <kwic:c type="right">leistet mir die beste Gesellschaft.</kwic:c> |
292 | | </kwic:kwic> |
293 | | </fcs:DataView> |
294 | | </fcs:Resource> |
295 | | </sru:recordData> |
296 | | <sru:recordPosition>100</sru:recordPosition> |
297 | | </sru:record> |
298 | | </sru:records> |
299 | | <sru:echoedSearchRetrieveRequest> |
300 | | <sru:version>1.2</sru:version> |
301 | | <sru:query>Gott</sru:query> |
302 | | <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/"> |
303 | | <searchClause> |
304 | | <index>cql.serverChoice</index> |
305 | | <relation> |
306 | | <value>=</value> |
307 | | </relation> |
308 | | <term>Gott</term> |
309 | | </searchClause> |
310 | | </sru:xQuery> |
311 | | <sru:baseUrl>http://clarin.ids-mannheim.de/cosmassru</sru:baseUrl> |
312 | | </sru:echoedSearchRetrieveRequest> |
| 261 | <sru:version>1.2</sru:version> |
| 262 | <sru:numberOfRecords>100</sru:numberOfRecords> |
| 263 | <sru:records> |
| 264 | <sru:record> |
| 265 | <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema> |
| 266 | <sru:recordPacking>xml</sru:recordPacking> |
| 267 | <sru:recordData> |
| 268 | <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGA.00000" |
| 269 | ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGA.00000"> |
| 270 | <fcs:DataView type="kwic"> |
| 271 | <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic"> |
| 272 | <kwic:c type="left"> |
| 273 | und so will ich denn hier auch noch anführen, daß ich in diesem Elend das neckische |
| 274 | Gelübde getan: man solle, wenn ich uns erlöst und mich wieder zu Hause sähe, von mir |
| 275 | niemals wieder einen Klagelaut vernehmen über den meine freiere Zimmeraussicht |
| 276 | beschränkenden Nachbargiebel, den ich vielmehr jetzt recht sehnlich zu erblicken |
| 277 | wünsche; ferner wollt' ich mich über Mißbehagen und Langeweile im deutschen Theater |
| 278 | nie wieder beklagen, wo man doch immer |
| 279 | </kwic:c> |
| 280 | <kwic:kw> |
| 281 | Gott |
| 282 | </kwic:kw> |
| 283 | <kwic:c type="right"> |
| 284 | danken könne, unter Dach zu sein, was auch auf der Bühne vorgehe. |
| 285 | </kwic:c> |
| 286 | </kwic:kwic> |
| 287 | </fcs:DataView> |
| 288 | </fcs:Resource> |
| 289 | </sru:recordData> |
| 290 | <sru:recordPosition>1</sru:recordPosition> |
| 291 | </sru:record> |
| 292 | <!-- some records skipped --> |
| 293 | <sru:record> |
| 294 | <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema> |
| 295 | <sru:recordPacking>xml</sru:recordPacking> |
| 296 | <sru:recordData> |
| 297 | <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGI.04846" |
| 298 | ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGI.04846"> |
| 299 | <fcs:DataView type="kwic"> |
| 300 | <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic"> |
| 301 | <kwic:c type="left"> der </kwic:c> |
| 302 | <kwic:kw> "Gott" </kwic:kw> |
| 303 | <kwic:c type="right"> leistet mir die beste Gesellschaft. </kwic:c> |
| 304 | </kwic:kwic> |
| 305 | </fcs:DataView> |
| 306 | </fcs:Resource> |
| 307 | </sru:recordData> |
| 308 | <sru:recordPosition>100</sru:recordPosition> |
| 309 | </sru:record> |
| 310 | </sru:records> |
| 311 | <sru:echoedSearchRetrieveRequest> |
| 312 | <sru:version>1.2</sru:version> |
| 313 | <sru:query>Gott</sru:query> |
| 314 | <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/"> |
| 315 | <searchClause> |
| 316 | <index>cql.serverChoice</index> |
| 317 | <relation> |
| 318 | <value>=</value> |
| 319 | </relation> |
| 320 | <term>Gott</term> |
| 321 | </searchClause> |
| 322 | </sru:xQuery> |
| 323 | <sru:baseUrl>http://clarin.ids-mannheim.de/cosmassru</sru:baseUrl> |
| 324 | </sru:echoedSearchRetrieveRequest> |