Changes between Version 24 and Version 25 of FCS-specification


Ignore:
Timestamp:
12/18/12 16:51:14 (11 years ago)
Author:
oschonef
Comment:

Update FCS Spec (work-in-progress)

Legend:

Unmodified
Added
Removed
Modified
  • FCS-specification

    v24 v25  
    1 = CLARIN Federated Search =
     1= CLARIN Federated Content Search (CLARIN-FCS) =
    22[[PageOutline(1-3)]]
    3 Authors: Herman Stehouwer, Dieter van Uytvanck \\
     3Authors: Herman Stehouwer, Dieter van Uytvanck, Oliver Schonefeld \\
    44Responsible: Herman Stehouwer \\
    5 Purpose: Provide an overview of the FCS and serve as a guide for implementation.
     5Purpose: Provide an overview of the CLARIN-FCS system and serve as a guide for implementation.
     6
     7== Introduction ==
     8
     9CLARIN Federated Content Search (CLARIN-FCS) is built upon the SRU/CQL standard. Search/Retrieval via URL SRU) specifies a general communication protocol for searching and retrieving records and the Contextual Query Language (CQL) specifies a extensible query language. This document specifies how CLARIN-FCS uses and extends SRU/CQL. For a proper understanding one needs to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.
     10
     11The design of the CLARIN-FCS system consists of two major parts:
     12 1. the communication protocol and the query language.
     13 2. the aggregator / user interface.
     14
     15This document consists of two parts: the first part deals with the global design thoughts, whereas the second part deals with the technical specification and provides implementation guide for endpoints.
     16
     17
     18= Global Design Thoughts =
    619
    720== Overview ==
    8 
    9 The design of the federated search system consists of two major parts:
    10  1. The communication protocol and the query language.
    11  2. The aggregator / user interface.
    12 {{{#!comment
    13 
    14 NOTE TO EDITOR: rework!
    15 
    16 This document deals with both parts. The first part of this document deals with the user interface and the global design thoughts, whereas the second part deals with the technical specification (communication protocol and query language).
    17 }}}
    18 
    19 The federated search depends on the specification and implementation of the VLO, the metadata search, the virtual collection registry, and CMDI.
    20 
    21 In general each CLARIN-center participating will provide at least the following services:
    22  * Provide one or more resources
    23  * Support Content-search within those resources
    24  * Return search-hits in the agreed-upon format
    25  * Support query-expansion if possible
    26  * Support the selection of a sub-part of the offered resources to perform content-search on that sup-part
    27  * Provide support for the sub-part selection by providing CMDI metadata at the same, reasonable , granularity
    28  
    29 = Global Design Thoughts =
    30 
    31 == Overview ==
    32  
    33 {{{#!comment
    34 
    35 NOTE TO EDITOR: redundant?
    36 
    37 The design of the federated search system consists of two major parts:
    38  1. The communication protocol and the query language.
    39  2. The aggregator / user interface.
    40 This part discusses the user-interface, wishes we have for the user interface, and how this affects or is affected by the protocol choice. 
    41 }}}
    42 
    43 The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that for a proper understanding one needs to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.
     21The CLARIN-FCS depends on the specification and implementation of the Virtual Language Observatory (VLO), the metadata search, the Virtual Collection Registry (VCR), and CMDI.
     22
     23In general each CLARIN center participating within CLARIN-FCS will provide at least the following services:
     24 * provide one or more resources
     25 * support Content-search within those resources
     26 * return search-hits in the agreed-upon format
     27 * support query-expansion (if possible)
     28 * support the selection of a sub-part of the offered resources to perform content-search on that sub-part
     29 * provide support for the sub-part selection by providing CMDI metadata at the same, reasonable, granularity
    4430
    4531== Corpus Selection ==
     
    5036In the future we see ourselves building a shared research infrastructure, where the search functionality is also usable within other programs (by directly calling endpoints) and in combination with the metadata search, VLO, and virtual collection registry. In order to enable the extension of the scope selection into a useful infrastructure several conditions need to be met:
    5137 1. CMDI metadata records need to be available for the resources at a useful granularity.
    52  2. Those CMDI records should contain a unique identifier.
    53  3. The endpoints need to understand their own, corresponding unique identifiers.
     38 2. those CMDI records should contain a unique identifier.
     39 3. the endpoints need to understand their own, corresponding unique identifiers.
    5440
    5541== Queries ==
    56 
    57 The query entry field is where the user enters the query. These queries are passed on directly to the endpoints corresponding to the selected corpora (as above). We note that as we use CQL quite complex queries can be entered as well as simple keyword searches. These queries should be simple first (just one string such as “laufen”), but we also should allow for a simple syntax allowing to specify attributes of the string (tiers) such as indicated in the following examples: “laufen and css.pos=noun”, “laufen and css.ges=stroke”. This needs to be specified and should remain simple enough to start with.
     42The query entry field is where the user enters the query. These queries are passed on directly to the endpoints corresponding to the selected corpora (as above). We note that as we use CQL quite complex queries can be entered as well as simple keyword searches. These queries should be simple first (just one string such as "laufen"), but we also should allow for a simple syntax allowing to specify attributes of the string (tiers) such as indicated in the following examples: "laufen and css.pos=noun", "laufen and css.ges=stroke". This needs to be specified and should remain simple enough to start with.
    5843Of course it is up to the local search engines whether they can resolve this to useful queries in their repositories and how they can do it.
    5944
    60 == Tagsets & Expanding the tags ==
     45== Tagsets and expanding the tags ==
    6146Users will want to search for specific properties encoded in annotations. These annotations, from different corpora, may store the same concept in different ways. For instance in gesture encoding, morphological segmentation encoding, part-of-speech tags,  or semantic information.
    6247
    63 This is the issue of tag expansion. If “no-expansion” has been chosen, then the query should be taken literally and no available relations should be used. If “expansion” has been chosen then relations should be used. We argue that that expansion of the tag to all locally applicable tags by making use of RELcat  should happen locally at the endpoint. After all, endpoints are meant to be used programmatically by many clients, not just our demonstrator. It’s the local repository who best understand how things can be extended in a useful way. Having only the relevant expansion happen at the endpoint would make using them in infrastructures easier and more attractive..
     48This is the issue of tag expansion. If "no-expansion" has been chosen, then the query should be taken literally and no available relations should be used. If “expansion” has been chosen then relations should be used. We argue that that expansion of the tag to all locally applicable tags by making use of RELcat  should happen locally at the endpoint. After all, endpoints are meant to be used programmatically by many clients, not just our demonstrator. It's the local repository who best understand how things can be extended in a useful way. Having only the relevant expansion happen at the endpoint would make using them in infrastructures easier and more attractive.
    6449
    6550== Return format ==
     
    7156 * !ResourceType = !SearchService
    7257 * mimetype = application/sru+xml
    73 
    7458{{{#!xml
    7559 <ResourceProxy id="d55">
     
    8569
    8670== General ==
    87 The user-interface (or aggregator) is a component that communicates with all endpoints within the federated search. Each endpoint offers some searchable resources. The content of these resources can be searched. Depending on the needs of the user and the capabilities of the endpoint several return formats are supported. The user-interface collects the responses from all the endpoints and shows these to the user. In order to facilitate this listing all endpoints will always return the results at least in the keyword-in-context format.
    88 
    89 The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that it is helpful to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.
    90 
    91 == SRU ==
    92 The SRU protocol supports three different operations: Explain, Scan, and SearchRetrieve. Of these, the Explain operation is the default. Parameters are given as HTTP GET or HTTP POST arguments. If no arguments are given the explain operation is performed.
    93 
    94 We again refer to the standard at http://www.loc.gov/standards/sru/ (version 1.2) where one can study the details of the protocol. Below we will only discuss the ways in which our implementations differ from the basic protocol, as well as the rest of our agreements (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora).
    95 
    96 As a namespace for our extensions to the SRU XML we use “http://clarin.eu/fcs/1.0”. The schema that validated our extension is found at: http://www.clarin.eu/system/files/Resource.xsd
    97 
    98 === Explain ===
    99 
    100 This basic request serves to announce server's capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the CCS-context*  set is to be used. We provide a telling example (as seen within the context of the explain response as defined on the SRU/CQL website):
    101 
     71From a technical perspective, CLARIN-FCS is consists of two major parts:
     72 * the ''aggregator'':  a user interface to perform queries and display search results
     73 * several ''endpoints'': a service that is provided by CLARIN centers who like to participate in CLARIN-FCS
     74The ''aggregator'' is a component that communicates with a component called ''endpoint'', which is provided as a service by all centers who like to participate within the federated content search. Each endpoint provides access to one or more searchable ''resources''. The content of these resources is searched with the ''query'' supplied to the endpoint. The endpoint returns results to this query in one or more representations, called ''data views'', i.e. a keyword-in-context (KWIC) data view. The aggregator collects the responses from all the endpoints and displays them to the user. In order to facilitate this listing all endpoints are required to support returning results at least in the keyword-in-context data view.
     75
     76== SRU/CQL ==
     77The SRU protocol supports three different operations: ''Explain'', ''Scan'', and ''!SearchRetrieve''. Of these, the ''Explain'' operation is the default. Parameters are given either as HTTP GET or HTTP POST arguments. If no arguments are given the ''Explain'' operation shall be performed by an endpoint.
     78
     79Generally, CLARIN-FCS is build upon the SRU/CQL standard (version 1.2), which is available from [http://www.loc.gov/standards/sru/ http://www.loc.gov/standards/sru/]. The following sections describe how CLARIN-FCS is mapped to SRU/CQL and how the protocol is extended to meet the requirements for CLARIN-FCS (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora).
     80
     81'''NOTE:''' Before starting to implement a endpoint at a center you are strongly encouraged to familiarize yourself with the SRU/CQL standard.
     82
     83=== Explain Operation ===
     84This basic request serves to announce endpoints capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the FCS-context* set is to be used. We provide a telling example (as seen within the context of the explain response as defined on the SRU/CQL website):
    10285{{{#!xml
    10386<indexInfo>
    104 <set identifier="isocat.org/datcat" name="isocat"/>
    105 <set identifier="http://clarin.eu/fcs/1.0" name="fcs"/>   
    106 
    107 <index id="?">
    108 <title lang="en">Part of Speech</title>   
    109 <map><name set="isocat">partOfSpeech</name></map>
    110     </index>
    111 
    112     <index id="?">
    113       <title lang="en">Words</title>
    114       <map><name set="fcs">word</name></map>
    115     </index>
    116 
    117     <index id="?">
    118       <title lang="en">Phonetics</title>
    119       <map><name set="fcs">phonetics</name></map>
    120     </index>         
     87  <set identifier="isocat.org/datcat" name="isocat"/>
     88  <set identifier="http://clarin.eu/fcs/1.0" name="fcs"/>   
     89  <index id="?">
     90    <title lang="en">Part of Speech</title>   
     91    <map>
     92      <name set="isocat">partOfSpeech</name>
     93    </map>
     94  </index>
     95  <index id="?">
     96    <title lang="en">Words</title>
     97    <map>
     98      <name set="fcs">word</name>
     99    </map>
     100  </index>
     101  <index id="?">
     102    <title lang="en">Phonetics</title>
     103    <map>
     104      <name set="fcs">phonetics</name>
     105    </map>
     106  </index>         
    121107</indexInfo>
    122108}}}
    123109
    124 The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: isocat.partOfSpeech, ccs.word, and ccs.phonetics. An example query could for instance be ccs.word = child.
    125 Note that the indexes given are completely dynamic, i.e., they can differ from endpoint to endpoint! We have a clear TODO here to agree on a common set of required/desirable indexes!
    126 
    127 The searchable indexes are used as a list in the “searchable tiers  / tier-types” in the user-interface.  It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. We consider it obvious that selectable tier-types trumps tier-names, e.g., “tr1, tr, trans, trans1” as a list is worse than “transcription” as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them.
    128 
    129 Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold. E.g., the annotation “up” for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better.
    130 
    131 === Scan ===
    132 
    133 '''NOTE:''' this section will be slightly changed in the near future.
    134 
    135 We foresee the scan operation as a way of signaling to the calling program/user/aggregator the available resources available for searching at the endpoint. This in contrast to the definition in SRU, where scan is a way to browse a list of keywords. The value of the scanClause parameter should be '''fcs.resource'''.
    136 
    137 To this the endpoint will return a list of terms, which are searchable collections. Their identifiers can than be used to restrict the search by passing one (or more) as parameters in x-context in the searchRetrieve operation.
    138 
    139 Again, we provide a telling example:
    140 {{{#!xml
    141 <sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" >
     110The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: {{{isocat.partOfSpeech}}}, {{{fcs.word}}}, and {{{fcs.phonetics}}}. An example query could for instance be {{{fcs.word = child}}}.
     111
     112'''Note:''' The indexes given in the example are completely dynamic, i.e., they can differ from endpoint to endpoint. We have a clear TODO here to agree on a common set of required/desirable indexes!
     113
     114The searchable indexes are used as a list in the "searchable tiers / tier-types" in the user-interface. It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. It is obvious, that selectable tier-types trumps tier-names, e.g., "tr1, tr, trans, trans1" as a list is worse than "transcription" as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them.
     115
     116Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold, e.g. the annotation "up" for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better.
     117
     118=== Scan Operation ===
     119Within CLARIN-FCS the scan operation is used to allow the calling agent (i.e. the aggregator) to enumerate all resources available for searching at a given endpoint. This in contrast to the definition in SRU, where Scan is a way to browse a list of keywords.
     120
     121The enumeration process works as follows:
     122 1. The agent performs a initial scan operation with {{{scanClause}}} parameter of {{{fcs.resource = root}}}.
     123 2. The endpoint will return a list of {{{<sru:terms>}}}, which denote searchable collections. The element {{{<sru:value>}}} of each {{{<sru:term>}}} element is used as an identifier for a collection at an endpoint. It must be a valid !MdSelfLink (which is defined as a URI). This identifier can than be used to restrict the search by passing one (or more) as parameters in the {{{x-cmd-context}}} parameter in the !SearchRetrieve operation. The optional {{{<sru:numberOfRecords>}}} element of the {{{<sru:term>}}} element may contains the (approximate) number of searchable resources within a collection, the optional element {{{<sru:displayName>}}} should contain the human-readable name of the collection.
     124 3. To check for sub-collection, the agent performs a [http://en.wikipedia.org/wiki/Tree_traversal tree traversal]. More specifically an agent substitutes the value {{{root}}} in the {{{scanClause}}} parameter for the identifier of a collection and performs another scan operation. The result is analogous to the one from the initial scan. If this collection has no sub-resources, the endpoint must return no terms et all.
     125'''Note:''' Providing an sufficient answer to the scan operation with the initial query of {{{fcs.resource = root}}} is '''mandatory''', even if the endpoint only supports a single collection. In that case the endpoint must return only one {{{<sru:term>}}} within the {{{<sru:terms>}}}. \\
     126'''Note:''' The value {{{root}}} used in the initial scan request is a reserved value and therefore cannot be used to identify any real resource at the endpoint (And it's not a URI anyways).
     127
     128Given the following example of collections and sub-collections (with invented Handles for collection identifiers).
     129{{{#!text
     130+-"collection 1", hdl:4711/1
     131| +-"collection 1.1", hdl:4711/4
     132| +-"collection 1.2", hdl:4711/5
     133+-"collection 2", hdl:4711/2
     134+-"collection 3", hdl:4711/3
     135  +-"collection 3.1", hdl:4711/6
     136}}}
     137The enumeration process will be performed as follows:
     138 * request for {{{fcs.resource = root}}} returns terms for "collection 1", "collection 2" and "collection 3"
     139 * request for {{{fcs.resource = hdl:4711/1}}} (= "collection 1") returns terms for "collection 1.1" and "collection 1.2"
     140 * request for {{{fcs.resource = hdl:4711/4}}} (= "collection 1.1") returns no terms
     141 * request for {{{fcs.resource = hdl:4711/5}}} (= "collection 1.2") returns no terms
     142 * request for {{{fcs.resource = hdl:4711/5}}} (= "collection 2") returns no terms
     143 * request for {{{fcs.resource = hdl:4711/3}}} (= "collection 3") returns terms for "collection 3.1"
     144 * request for {{{fcs.resource = hdl:4711/6}}} (= "collection 3.1") returns no terms
     145
     146'''TODO''' resource-info stuff
     147
     148A real-world example of a response to a scan request with a {{{scanClause}}} parameter of {{{fcs.resource = root}}} could look like the following:
     149{{{#!xml
     150<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/">
    142151<sru:version>1.2</sru:version>
    143152  <sru:terms> 
    144153    <sru:term>
    145           <sru:value>hdl:1839/00-0000-0000-0001-53A5-2</sru:value>
    146           <sru:numberOfRecords>12098</sru:numberOfRecords>
    147           <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm>
    148     </sru:term>
    149     <sru:term>
    150           <sru:value>http://corpus1.mpi.nl/qfs1/media-archive/mirrored_corpora/childes/Corpusstructure/childes.imdi</sru:value>
    151           <sru:numberOfRecords>42</sru:numberOfRecords>
    152           <sru:displayTerm>Childes corpus</sru:displayTerm>
     154      <sru:value>hdl:1839/00-0000-0000-0001-53A5-2</sru:value>
     155      <sru:numberOfRecords>12098</sru:numberOfRecords>
     156      <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm>
     157    </sru:term>
     158    <sru:term>
     159      <sru:value>http://corpus1.mpi.nl/qfs1/media-archive/mirrored_corpora/childes/Corpusstructure/childes.imdi</sru:value>
     160      <sru:numberOfRecords>42</sru:numberOfRecords>
     161      <sru:displayTerm>Childes corpus</sru:displayTerm>
    153162    </sru:term>
    154163  </sru:terms>
     
    162171}}}
    163172
    164 Note that the values in the sru:value elements should be valid [http://www.clarin.eu/faq/3460 MdSelfLink]. These MdSelfLinks should also be available from within the matching CMDI metadata file (via a reference in the Header section - see also below under "Restricting the search").
    165 
    166 '''Note:''' Giving an answer to the scan operation with the query fcs.resource is obligatory. Even if there is only 1 collection available, in that case the endpoint returns only one term.
    167 
    168 Additionally it is possible (but not obligatory) to perform extra Scan operations to retrieve subcollections, as in a [http://en.wikipedia.org/wiki/Tree_traversal tree traversal] algorithm.
    169 
    170 E.g. to find out the subcollections of the CGN-Corpus in the example above one would perform the following scan operation: http://clarin_srucql_endpoint?operation=Scan&version=1.2&scanClause=fcs.resource=hdl:1839/00-0000-0000-0001-53A5-2
    171 
    172 {{{#!xml
    173 <sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" >
     173Note that the values in the {{{<sru:value>}}} elements should be valid [http://www.clarin.eu/faq/3460 MdSelfLink]. These !MdSelfLinks should also be available from within the matching CMDI metadata file (via a reference in the Header section - see also below under "Restricting the search").
     174
     175To find out about sub-collections of the CGN-Corpus from the preceding example, the agent would perform a scan request with a {{{scanClause}}} parameter of {{{fcs.reource = hdl:1839/00-0000-0000-0001-53A5-2}}} and the result could look like the following:
     176{{{#!xml
     177<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/">
    174178<sru:version>1.2</sru:version>
    175179  <sru:terms> 
    176180    <sru:term>
    177           <sru:value>hdl:1839/00-0000-0000-0003-467E-9</sru:value>
    178           <sru:numberOfRecords>300</sru:numberOfRecords>
    179           <sru:displayTerm>Annotation types</sru:displayTerm>
    180     </sru:term>
    181     <sru:term>
    182           <sru:value>hdl:1839/00-0000-0000-0003-4682-F</sru:value>
    183           <sru:numberOfRecords>400</sru:numberOfRecords>
    184           <sru:displayTerm>Components</sru:displayTerm>
    185     </sru:term>
    186     <sru:term>
    187           <sru:value>hdl:1839/00-0000-0000-0003-4692-D</sru:value>
    188           <sru:numberOfRecords>350</sru:numberOfRecords>
    189           <sru:displayTerm>Regions</sru:displayTerm>
     181      <sru:value>hdl:1839/00-0000-0000-0003-467E-9</sru:value>
     182      <sru:numberOfRecords>300</sru:numberOfRecords>
     183      <sru:displayTerm>Annotation types</sru:displayTerm>
     184    </sru:term>
     185    <sru:term>
     186      <sru:value>hdl:1839/00-0000-0000-0003-4682-F</sru:value>
     187      <sru:numberOfRecords>400</sru:numberOfRecords>
     188      <sru:displayTerm>Components</sru:displayTerm>
     189    </sru:term>
     190    <sru:term>
     191      <sru:value>hdl:1839/00-0000-0000-0003-4692-D</sru:value>
     192      <sru:numberOfRecords>350</sru:numberOfRecords>
     193      <sru:displayTerm>Regions</sru:displayTerm>
    190194    </sru:term>
    191195  </sru:terms>
     
    199203}}}
    200204
    201 So in this scan response the endpoint announces that the CGN-Corpus (identified with the unique MdSelfLink hdl:1839/00-0000-0000-0001-53A5-2) has 3 sub-collections:
    202  * Annotation types (containing 300 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-467E-9)
    203  * Components (containing 400 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-4682-F)
    204  * Regions (containing 350 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-4692-D)
    205 
    206 These results can then later on be used to restrict a content search to one (or more) (sub)collections.
    207 
    208 === !SearchRetrieve ===
    209 
    210 The !SearchRetrieve operation is the operation that is used for searching in the resources that are provided by the endpoint. The responds provides an XML wrapper to a set of results. Here we follow the SRU standard (verison 1.2) down to the <sru:record> elements. Each <record> represents one hit of the query on the data.
    211 
    212 Within each record we use our own XML structure that matches the concept of searchable resources. Each record contains one resource. The resource has a PID. The correct resource to return here is the most precise unit of data that is directly addressable as a “whole”.  This resource might contain a resourceFragment. The resourceFragment has an offset, i.e., a resource fragment is a subset of the resource that is addressable. Using a resourceFragment is optional, but encouraged.
    213 
    214 Within both the resource and the resourceFragment there can be dataView elements. At least one kwic  dataView element is required, within the resourceFragment (if applicable), otherwise within the resource (if there is no resourceFragment). Other dataViews should be put in a place that is logical for their content (as is to be determined by the endpoint, e.g., a metadata dataView would most likely be directly under a resource. On the other hand a dataView representing some annotation layers directly around the hit is more likely to belong in the resourceFragment.)
    215 
    216 Each element (resource, resourceFragment, dataview) will contain a “ref” attribute, which points to the original data represented by the resource, fragment, or dataview as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a webpage describing a corpus or datacollection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. We will strive to provide links that are as specific as possible/logical.
    217 
    218 We already define several allowed dataView formats below. Further extention in the future with more supported formats is deliberately kept open. Currently an active discussion is happening within the CLARIN-D community about which formats to use.
    219 
    220 Below we also cover a second topic related to the searchRetrieve operation, the expansion of search queries.
    221 
    222 == Return formats (DataView) ==
    223 
    224 There are several dataviews agreed upon. Each dataView will have an attribute “type”, which has as value the type of dataView contained. It is possible to, in the future, add different dataviews if required. It is mandatody to support the KWIC dataview (as this type is fairly straightforward to show as a list of results).
    225 Our KWIC dataView looks as follows:
    226 {{{#!xml
    227 <fcs:DataView type="kwic">
    228         <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
    229                 <kwic:c type="left">Some text with the </kwic:c>
    230                 <kwic:kw>keyword</kwic:kw>
    231                 <kwic:c type="right">highlighted.</kwic:c>
    232         </kwic:kwic>
     205So in this scan response the endpoint announces that the CGN-Corpus (identified with the unique !MdSelfLink hdl:1839/00-0000-0000-0001-53A5-2) has 3 sub-collections:
     206 * Annotation types (containing 300 elements, identified by the !MdSelfLink hdl:1839/00-0000-0000-0003-467E-9)
     207 * Components (containing 400 elements, identified by the !MdSelfLink hdl:1839/00-0000-0000-0003-4682-F)
     208 * Regions (containing 350 elements, identified by the !MdSelfLink hdl:1839/00-0000-0000-0003-4692-D)
     209
     210These results can then later on be used to restrict a content search to one (or more) (sub-)collections.
     211
     212=== !SearchRetrieve Operation ===
     213==== Overview ====
     214The !SearchRetrieve operation is the operation that is used for searching in the resources that are provided by the endpoint. The SRU standard defines a XML structure how to encode the results down to a record level, i.e. {{{<sru:record>}}} elements. SRU allows records to be serialized in multiple formats, so called ''record schemas''. For CLARIN-FCS, a custom record schema has been defined. The ''record schema identifier'' for this schema is {{{http://clarin.eu/fcs/1.0}}} and the appropriate XML Schema can be found at [http://www.clarin.eu/system/files/Resource.xsd http://www.clarin.eu/system/files/Resource.xsd].
     215
     216Within CLARIN-FCS, each {{{<sru:record>}}} element represents one hit within the ''resource'', which is encoded as a {{{<fcs:Resource>}}} element. Each resource shall be identified a persistent identifier (or, less preferably, a endpoint unique URI). The correct resource to return here is the most precise unit of data that is directly addressable as a "whole".  The hit may contain a ''resource fragment'', which is encoded as as a {{{<fcs:ResourceFragment>}}} element. The resource fragment shall be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using resource fragments is optional, but encouraged.
     217
     218The actual hit within a resource is provided encoded as a ''data view'' format and is serialized as a {{{<fcs:DataView>}}} element inside either the {{{<fcs:Resource}}} or {{{<fcs:ResourceFragment>}}} element. Each hit may be serialized as multiple data views, however the keyword-in-context (KWIC) data view is mandatory with the resource fragment (if applicable), or otherwise within the resource (if there is no reasonable resource fragment). Other data views should be put in a place that is logical for their content (as is to be determined by the endpoint. E.g. a metadata data view would most likely be put directly under a resource. On the other hand a data view representing some annotation layers directly around the hit is more likely to belong in within the resource fragment.
     219
     220Each entity (i.e. {{{<fcs:Resource>}}}, {{{<fcs:ResourceFragment>}}} or {{{<fcs:DataView>}}} element) contains a {{{ref}}} attribute, which points to the original data represented by the resource, resource fragment, or data view as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a web-page describing a corpus or collection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. Endpoints should provide links that are as specific as possible/logical.
     221
     222
     223==== Data View and Data View formats ====
     224The data views are designed to allow for different representations of search results within CLARIN-FCS. They are deliberately kept open to allow further extensions in the future with more supported data view formats.
     225
     226The type of each data view is identified by the {{{type}}} attribute of the {{{<fcs:DataView}}} element. The value if defined to be a [http://en.wikipedia.org/wiki/MIME_Type MIME type]. If no existing MIME type can be used, implementors are encouraged to define a properer private mime type. The following formats are currently being considered:
     227 Keyword-In-Context (KWIC)::
     228   Description: a keyword-in-context view, where each hit should be presented within the context of a complete sentence (if possible) or any other reasonable unit of context (e.g. if sentences cannot be determined by the endpoint). The keyword-in-context data view is '''mandatory''' for all endpoints. \\
     229   Type: {{{application/x-clarin-fcs-kwic+xml}}} \\
     230   Example for a keyword-in-context data view:
     231{{{#!xml
     232<fcs:DataView type="application/x-clarin-fcs-kwic+xml">
     233  <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
     234    <kwic:c type="left">Some text with the </kwic:c>
     235    <kwic:kw>keyword</kwic:kw>
     236    <kwic:c type="right">highlighted.</kwic:c>
     237  </kwic:kwic>
    233238</fcs:DataView>
    234239}}}
    235 
    236 Another possible dataView is currently CMDI. The metadata should be applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resourceFragment element is should apply specifically to the resourceFragment (i.e., more specialized than the metadata for the encompassing resource).
    237 
    238 Furthermore, the geographic format KML is valid as a dataView. (see http://www.opengeospatial.org/standards/kml for a specification).
    239 
    240 Finally, we also specify the possible inclusion of three different content-container dataViews:
    241  1. Fulltext; this dataView simply contains a (reasonably sized) bunch of running text in between the dataView tags. By providing this format it allows the CLARIN centers that have the possibility of making available full-text un-annotated data to make available such data. Some examples here are the Goethe and the El-Pais corpora.
    242  2. EAF: this dataView contains the EAF format as specified elsewhere. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture.
    243  3. TCF: this dataView contains the TCF format as specified for the WEBLICHT program. By having data available in TCF (where applicable) we create a more integrated search architecture, where search results can be further processed in a weblicht toolchain.
     240  CMDI metadata::
     241    Description: a CMDI metadata record applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resource fragment is should apply specifically to the resource fragment (i.e., more specialized than the metadata for the encompassing resource). \\
     242    Type: {{{application/x-cmdi+xml}}} \\
     243  Geolocation (KML)::
     244    Description: A geographic location encoded in the Keyhole Markup Language. \\
     245    Type: {{{application/vnd.google-earth.kml+xml}}}
     246
     247Currently, the inclusion of three different content-container data views is being discussed:
     248 1. Fulltext; this data view simply contains a (reasonably sized) chunk of running text. This format would allow CLARIN centers to make un-annotated full-text data available. Some examples here are the Goethe and the El-Pais corpora.
     249 2. EAF: this data view contains the EAF format. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture.
     250 3. TCF: this data view contains the TCF format as specified for the !WebLicht program. The format would allow users to further process data within the !WebLicht.
     251
     252
     253'''TODO''' CONTINUE FROM HERE
    244254
    245255We give two clear, but brief example of the correct format of the searchRetrieve response that includes just the KWIC DataView type.
     
    248258{{{#!xml
    249259<?xml version="1.0" encoding="UTF-8"?>
    250 
    251260<sru:searchRetrieveResponse xmlns:sru="http://www.loc.gov/zing/srw/">
    252         <sru:version>1.2</sru:version>
    253         <sru:numberOfRecords>100</sru:numberOfRecords>
    254         <sru:records>
    255                 <sru:record>
    256                         <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema>
    257                         <sru:recordPacking>xml</sru:recordPacking>
    258                         <sru:recordData>
    259                                 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGA.00000"
    260                                         ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGA.00000">
    261                                         <fcs:DataView type="kwic">
    262                                                 <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
    263                                                         <kwic:c type="left"> und so will ich denn hier auch noch anführen, daß
    264                                                                 ich in diesem Elend das neckische Gelübde getan: man solle, wenn ich
    265                                                                 uns erlöst und mich wieder zu Hause sähe, von mir niemals wieder
    266                                                                 einen Klagelaut vernehmen über den meine freiere Zimmeraussicht
    267                                                                 beschränkenden Nachbargiebel, den ich vielmehr jetzt recht sehnlich
    268                                                                 zu erblicken wünsche; ferner wollt' ich mich über Mißbehagen und
    269                                                                 Langeweile im deutschen Theater nie wieder beklagen, wo man doch
    270                                                                 immer </kwic:c>
    271                                                         <kwic:kw>Gott</kwic:kw>
    272                                                         <kwic:c type="right"> danken könne, unter Dach zu sein, was auch auf der
    273                                                                 Bühne vorgehe. </kwic:c>
    274                                                 </kwic:kwic>
    275                                         </fcs:DataView>
    276                                 </fcs:Resource>
    277                         </sru:recordData>
    278                         <sru:recordPosition>1</sru:recordPosition>
    279                 </sru:record>
    280                 <!-- Some Records skipped -->
    281                 <sru:record>
    282                         <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema>
    283                         <sru:recordPacking>xml</sru:recordPacking>
    284                         <sru:recordData>
    285                                 <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGI.04846"
    286                                         ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGI.04846">
    287                                         <fcs:DataView type="kwic">
    288                                                 <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
    289                                                         <kwic:c type="left">der</kwic:c>
    290                                                         <kwic:kw>"Gott"</kwic:kw>
    291                                                         <kwic:c type="right">leistet mir die beste Gesellschaft.</kwic:c>
    292                                                 </kwic:kwic>
    293                                         </fcs:DataView>
    294                                 </fcs:Resource>
    295                         </sru:recordData>
    296                         <sru:recordPosition>100</sru:recordPosition>
    297                 </sru:record>
    298         </sru:records>
    299         <sru:echoedSearchRetrieveRequest>
    300                 <sru:version>1.2</sru:version>
    301                 <sru:query>Gott</sru:query>
    302                 <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/">
    303                         <searchClause>
    304                                 <index>cql.serverChoice</index>
    305                                 <relation>
    306                                         <value>=</value>
    307                                 </relation>
    308                                 <term>Gott</term>
    309                         </searchClause>
    310                 </sru:xQuery>
    311                 <sru:baseUrl>http://clarin.ids-mannheim.de/cosmassru</sru:baseUrl>
    312         </sru:echoedSearchRetrieveRequest>
     261  <sru:version>1.2</sru:version>
     262  <sru:numberOfRecords>100</sru:numberOfRecords>
     263  <sru:records>
     264    <sru:record>
     265      <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema>
     266      <sru:recordPacking>xml</sru:recordPacking>
     267      <sru:recordData>
     268        <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGA.00000"
     269                      ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGA.00000">
     270          <fcs:DataView type="kwic">
     271            <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
     272              <kwic:c type="left">
     273                und so will ich denn hier auch noch anführen, daß ich in diesem Elend das neckische
     274                Gelübde getan: man solle, wenn ich uns erlöst und mich wieder zu Hause sähe, von mir
     275                niemals wieder einen Klagelaut vernehmen über den meine freiere Zimmeraussicht
     276                beschränkenden Nachbargiebel, den ich vielmehr jetzt recht sehnlich zu erblicken
     277                wünsche; ferner wollt' ich mich über Mißbehagen und Langeweile im deutschen Theater
     278                nie wieder beklagen, wo man doch immer
     279              </kwic:c>
     280              <kwic:kw>
     281                Gott
     282              </kwic:kw>
     283              <kwic:c type="right">
     284                danken könne, unter Dach zu sein, was auch auf der Bühne vorgehe.
     285              </kwic:c>
     286            </kwic:kwic>
     287          </fcs:DataView>
     288        </fcs:Resource>
     289      </sru:recordData>
     290      <sru:recordPosition>1</sru:recordPosition>
     291    </sru:record>
     292    <!-- some records skipped -->
     293    <sru:record>
     294      <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema>
     295      <sru:recordPacking>xml</sru:recordPacking>
     296      <sru:recordData>
     297        <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGI.04846"
     298                      ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGI.04846">
     299          <fcs:DataView type="kwic">
     300            <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
     301              <kwic:c type="left"> der </kwic:c>
     302              <kwic:kw> "Gott" </kwic:kw>
     303              <kwic:c type="right"> leistet mir die beste Gesellschaft. </kwic:c>
     304            </kwic:kwic>
     305          </fcs:DataView>
     306        </fcs:Resource>
     307     </sru:recordData>
     308     <sru:recordPosition>100</sru:recordPosition>
     309   </sru:record>
     310  </sru:records>
     311  <sru:echoedSearchRetrieveRequest>
     312    <sru:version>1.2</sru:version>
     313    <sru:query>Gott</sru:query>
     314    <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/">
     315      <searchClause>
     316        <index>cql.serverChoice</index>
     317        <relation>
     318          <value>=</value>
     319        </relation>
     320        <term>Gott</term>
     321      </searchClause>
     322    </sru:xQuery>
     323    <sru:baseUrl>http://clarin.ids-mannheim.de/cosmassru</sru:baseUrl>
     324  </sru:echoedSearchRetrieveRequest>
    313325</sru:searchRetrieveResponse>
    314 
    315 }}}
     326}}}
     327
    316328
    317329== Query Expansion ==