wiki:FCS-specification

Version 23 (modified by dietuyt, 12 years ago) (diff)

added reference to ResourceProxy? SearchService? type

CLARIN Federated Search

Authors: Herman Stehouwer, Dieter van Uytvanck Responsible: Herman Stehouwer Purpose: Provide an overview of the FCS and serve as a guide for implementation.

Overview

The design of the federated search system consists of two major parts: 1 The communication protocol and the query language. 2 The aggregator / user interface. This document deals with both parts. The first part of this document deals with the user interface and the global design thoughts, whereas the second part deals with the technical specification (communication protocol and query language).

The federated search depends on the specification and implementation of the VLO, the metadata search, the virtual collection registry, and CMDI.

In general each CLARIN-center participating will provide at least the following services:

  • Provide one or more resources
  • Support Content-search within those resources
  • Return search-hits in the agreed-upon format
  • Support query-expansion if possible
  • Support the selection of a sub-part of the offered resources to perform content-search on that sup-part
  • Provide support for the sub-part selection by providing CMDI metadata at the same, reasonable , granularity

Global Design Thoughts / The Aggregator

Overview

The design of the federated search system consists of two major parts: 1 The communication protocol and the query language. 2 The aggregator / user interface. This part discusses the user-interface, wishes we have for the user interface, and how this affects or is affected by the protocol choice.

The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that for a proper understanding one needs to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.

Corpus Selection

We plan to show a list of available corpora (collections) that are searchable by the aggregator. Probably we need to allow people to make a rough selection first between types of collections (this could be relevant later). Each end-point serves one or more of these corpora/resources. If the endpoint serves more than one corpus/collection it must support the domain/scope selection method (as explained on the clarin.eu wiki).

Furthermore, after selecting one or more corpora the user can further constrain the search by determining which tiers or tier-types he or she wants to search in. The tiers or tier-types available for search are reported by the endpoints.

In the future we see ourselves building a shared research infrastructure, where the search functionality is also usable within other programs (by directly calling endpoints) and in combination with the metadata search, VLO, and virtual collection registry. In order to enable the extension of the scope selection into a useful infrastructure several conditions need to be met: 1 CMDI metadata records need to be available for the resources at a useful granularity. 2 Those CMDI records should contain a unique identifier. 3 The endpoints need to understand their own, corresponding unique identifiers.

Queries

The query entry field is where the user enters the query. These queries are passed on directly to the endpoints corresponding to the selected corpora (as above). We note that as we use CQL quite complex queries can be entered as well as simple keyword searches. These queries should be simple first (just one string such as “laufen”), but we also should allow for a simple syntax allowing to specify attributes of the string (tiers) such as indicated in the following examples: “laufen and css.pos=noun”, “laufen and css.ges=stroke”. This needs to be specified and should remain simple enough to start with. Of course it is up to the local search engines whether they can resolve this to useful queries in their repositories and how they can do it.

Tagsets & Expanding the tags

Users will want to search for specific properties encoded in annotations. These annotations, from different corpora, may store the same concept in different ways. For instance in gesture encoding, morphological segmentation encoding, part-of-speech tags, or semantic information.

This is the issue of tag expansion. If “no-expansion” has been chosen, then the query should be taken literally and no available relations should be used. If “expansion” has been chosen then relations should be used. We argue that that expansion of the tag to all locally applicable tags by making use of RELcat should happen locally at the endpoint. After all, endpoints are meant to be used programmatically by many clients, not just our demonstrator. It’s the local repository who best understand how things can be extended in a useful way. Having only the relevant expansion happen at the endpoint would make using them in infrastructures easier and more attractive..

Technical details aggregator

The aggregator exists of 2 parts:

  • a backend, responsible for all communication with the end points
  • a frontend, the web GUI that provides an end user interface to the functionality of the backend

Input for the aggregator backend

The aggregator backend is also an SRU/CQL server, however it does not function as an endpoint. Instead it distributes the incoming queries to the endpoints it knows and then aggregates the results.

  • How does the aggregator know the endpoints?
    • It knows the endpoints by querying the CLARIN center registry?
    • It also accepts links to CLARIN-compatible endpoints explicitly given as a parameter (see below)
  • How can the aggregator backend restrict a content search to certain (sub)collections?
    • With a list of [MdSelfLink?, endpoint URL] pairs in JSON, sent as x-aggregation-context parameter for SearchRetrieve

An example of the JSON pairs:

{
    "hdl:1839/00-0000-0000-0003-467E-9": "http://cqlservlet.mpi.nl",
    "hdl:1839/00-0000-0000-0003-4682-F": "http://cqlservlet.mpi.nl",
    "hdl:1839/00-0000-0000-0003-4692-D": "http://cqlservlet.mpi.nl"
}

An example of the JSON directly above used to restrict the search at the aggregator (using HTTP POST, and searching for the string "bellen"):

POST http://aggregator.clarin.eu
operation: searchRetrieve
version: 1.2
query: bellen
x-aggregation-context: {"hdl:1839/00-0000-0000-0003-467E-9":"http://cqlservlet.mpi.nl","hdl:1839/00-0000-0000-0003-4682-F": "http://cqlservlet.mpi.nl","hdl:1839/00-0000-0000-0003-4692-D":"http://cqlservlet.mpi.nl"}

Note 1: the endpoint URL needs to be registered in the CenterRegistry? (to prevent the risk of DDOS attacks and privilege escalations via the aggregator) Note 2: Scalability:

  • an example post of 100.000 pairs could result in a post of about 5 MB (should work)
  • the most expensive operation will take place at the end points: correctly restricting the search given a list of metadata collections

Referring to an SRU endpoint from a CMDI file

This can be done with a ResourceProxy where:

  • ResourceType = SearchService
  • mimetype = application/sru+xml
 <ResourceProxy id="d55">
   <ResourceType mimetype="application/sru+xml">SearchService</ResourceType>
   <ResourceRef>http://cqlservlet.mpi.nl/</ResourceRef>
 </ResourceProxy>

For a complete example file see: http://www.clarin.eu/cmd/example/example-cgn-sru.cmdi

Return format

We have defined the use of several return formats. These include, keyword in context, KML, TCF, EAF, and full-text. More information on the formats that are supported in the system can be found in the technical design section.

Technical Design / Implementation Guide

The user-interface (or aggregator) is a component that communicates with all endpoints within the federated search. Each endpoint offers some searchable resources. The content of these resources can be searched. Depending on the needs of the user and the capabilities of the endpoint several return formats are supported. The user-interface collects the responses from all the endpoints and shows these to the user. In order to facilitate this listing all endpoints will always return the results at least in the keyword-in-context format.

The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that it is helpful to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.

SRU

The SRU protocol supports three different operations: Explain, Scan, and SearchRetrieve?. Of these, the Explain operation is the default. Parameters are given as HTTP GET or HTTP POST arguments. If no arguments are given the explain operation is performed.

We again refer to the standard at http://www.loc.gov/standards/sru/ (version 1.2) where one can study the details of the protocol. Below we will only discuss the ways in which our implementations differ from the basic protocol, as well as the rest of our agreements (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora).

As a namespace for our extensions to the SRU XML we use “http://clarin.eu/fcs/1.0”. The schema that validated our extension is found at: http://www.clarin.eu/system/files/Resource.xsd

Explain

This basic request serves to announce server's capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the CCS-context* set is to be used. We provide a telling example (as seen within the context of the explain response as defined on the SRU/CQL website):

<indexInfo>
<set identifier="isocat.org/datcat" name="isocat"/>
<set identifier="http://clarin.eu/fcs/1.0" name="fcs"/>    


<index id="?">
<title lang="en">Part of Speech</title>    
<map><name set="isocat">partOfSpeech</name></map>
    </index>

    <index id="?">
      <title lang="en">Words</title>
      <map><name set="fcs">word</name></map>
    </index>

    <index id="?">
      <title lang="en">Phonetics</title>
      <map><name set="fcs">phonetics</name></map>
    </index>         
</indexInfo>

The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: isocat.partOfSpeech, ccs.word, and ccs.phonetics. An example query could for instance be ccs.word = child. Note that the indexes given are completely dynamic, i.e., they can differ from endpoint to endpoint! We have a clear TODO here to agree on a common set of required/desirable indexes!

The searchable indexes are used as a list in the “searchable tiers / tier-types” in the user-interface. It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. We consider it obvious that selectable tier-types trumps tier-names, e.g., “tr1, tr, trans, trans1” as a list is worse than “transcription” as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them.

Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold. E.g., the annotation “up” for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better.

Scan

NOTE: this section will be slightly changed in the near future.

We foresee the scan operation as a way of signaling to the calling program/user/aggregator the available resources available for searching at the endpoint. This in contrast to the definition in SRU, where scan is a way to browse a list of keywords. The value of the scanClause parameter should be fcs.resource.

To this the endpoint will return a list of terms, which are searchable collections. Their identifiers can than be used to restrict the search by passing one (or more) as parameters in x-context in the searchRetrieve operation.

Again, we provide a telling example:

<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" > 
<sru:version>1.2</sru:version> 
  <sru:terms>  
    <sru:term> 
          <sru:value>hdl:1839/00-0000-0000-0001-53A5-2</sru:value> 
          <sru:numberOfRecords>12098</sru:numberOfRecords> 
          <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>http://corpus1.mpi.nl/qfs1/media-archive/mirrored_corpora/childes/Corpusstructure/childes.imdi</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>Childes corpus</sru:displayTerm> 
    </sru:term> 
  </sru:terms> 
  <sru:echoedScanRequest> 
    <sru:version>1.2</sru:version> 
    <sru:scanClause>fcs.resource</sru:scanClause> 
    <sru:responsePosition></sru:responsePosition> 
    <sru:maximumTerms>42</sru:maximumTerms> 
  </sru:echoedScanRequest> 
</sru:scanResponse>

Note that the values in the sru:value elements should be valid MdSelfLink. These MdSelfLinks? should also be available from within the matching CMDI metadata file (via a reference in the Header section - see also below under "Restricting the search").

Note: Giving an answer to the scan operation with the query fcs.resource is obligatory. Even if there is only 1 collection available, in that case the endpoint returns only one term.

Additionally it is possible (but not obligatory) to perform extra Scan operations to retrieve subcollections, as in a tree traversal algorithm.

E.g. to find out the subcollections of the CGN-Corpus in the example above one would perform the following scan operation: http://clarin_srucql_endpoint?operation=Scan&version=1.2&scanClause=fcs.resource=hdl:1839/00-0000-0000-0001-53A5-2

<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" > 
<sru:version>1.2</sru:version> 
  <sru:terms>  
    <sru:term> 
          <sru:value>hdl:1839/00-0000-0000-0003-467E-9</sru:value> 
          <sru:numberOfRecords>300</sru:numberOfRecords> 
          <sru:displayTerm>Annotation types</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>hdl:1839/00-0000-0000-0003-4682-F</sru:value> 
          <sru:numberOfRecords>400</sru:numberOfRecords> 
          <sru:displayTerm>Components</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>hdl:1839/00-0000-0000-0003-4692-D</sru:value> 
          <sru:numberOfRecords>350</sru:numberOfRecords> 
          <sru:displayTerm>Regions</sru:displayTerm> 
    </sru:term> 
  </sru:terms> 
  <sru:echoedScanRequest> 
    <sru:version>1.2</sru:version> 
    <sru:scanClause>fcs.resource=hdl:1839/00-0000-0000-0001-53A5-2</sru:scanClause> 
    <sru:responsePosition></sru:responsePosition> 
    <sru:maximumTerms>42</sru:maximumTerms> 
  </sru:echoedScanRequest> 
</sru:scanResponse>

So in this scan response the endpoint announces that the CGN-Corpus (identified with the unique MdSelfLink? hdl:1839/00-0000-0000-0001-53A5-2) has 3 subcollections:

  • Annotation types (containing 300 elements, identified by the MdSelfLink? hdl:1839/00-0000-0000-0003-467E-9)
  • Components (containing 400 elements, identified by the MdSelfLink? hdl:1839/00-0000-0000-0003-4682-F)
  • Regions (containing 350 elements, identified by the MdSelfLink? hdl:1839/00-0000-0000-0003-4692-D)

These results can then later on be used to restrict a content search to one (or more) (sub)collections.

SearchRetrieve

The SearchRetrieve operation is the operation that is used for searching in the resources that are provided by the endpoint. The responds provides an XML wrapper to a set of results. Here we follow the SRU standard (verison 1.2) down to the <sru:record> elements. Each <record> represents one hit of the query on the data.

Within each record we use our own XML structure that matches the concept of searchable resources. Each record contains one resource. The resource has a PID. The correct resource to return here is the most precise unit of data that is directly addressable as a “whole”. This resource might contain a resourceFragment. The resourceFragment has an offset, i.e., a resource fragment is a subset of the resource that is addressable. Using a resourceFragment is optional, but encouraged.

Within both the resource and the resourceFragment there can be dataView elements. At least one kwic dataView element is required, within the resourceFragment (if applicable), otherwise within the resource (if there is no resourceFragment). Other dataViews should be put in a place that is logical for their content (as is to be determined by the endpoint, e.g., a metadata dataView would most likely be directly under a resource. On the other hand a dataView representing some annotation layers directly around the hit is more likely to belong in the resourceFragment.)

Each element (resource, resourceFragment, dataview) will contain a “ref” attribute, which points to the original data represented by the resource, fragment, or dataview as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a webpage describing a corpus or datacollection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. We will strive to provide links that are as specific as possible/logical.

We already define several allowed dataView formats below. Further extention in the future with more supported formats is deliberately kept open. Currently an active discussion is happening within the CLARIN-D community about which formats to use.

Below we also cover a second topic related to the searchRetrieve operation, the expansion of search queries.

Return formats (DataView?)

There are several dataviews agreed upon. Each dataView will have an attribute “type”, which has as value the type of dataView contained. It is possible to, in the future, add different dataviews if required. It is mandatody to support the KWIC dataview (as this type is fairly straightforward to show as a list of results). Our KWIC dataView looks as follows:

<fcs:DataView type="kwic">
        <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
                <kwic:c type="left">Some text with the </kwic:c>
                <kwic:kw>keyword</kwic:kw>
                <kwic:c type="right">highlighted.</kwic:c>
        </kwic:kwic>
</fcs:DataView>

Another possible dataView is currently CMDI. The metadata should be applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resourceFragment element is should apply specifically to the resourceFragment (i.e., more specialized than the metadata for the encompassing resource).

Furthermore, the geographic format KML is valid as a dataView. (see http://www.opengeospatial.org/standards/kml for a specification).

Finally, we also specify the possible inclusion of three different content-container dataViews: 1 Fulltext; this dataView simply contains a (reasonably sized) bunch of running text in between the dataView tags. By providing this format it allows the CLARIN centers that have the possibility of making available full-text un-annotated data to make available such data. Some examples here are the Goethe and the El-Pais corpora. 2 EAF: this dataView contains the EAF format as specified elsewhere. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture. 3 TCF: this dataView contains the TCF format as specified for the WEBLICHT program. By having data available in TCF (where applicable) we create a more integrated search architecture, where search results can be further processed in a weblicht toolchain.

We give two clear, but brief example of the correct format of the searchRetrieve response that includes just the KWIC DataView? type.

First example, based on http://clarin.ids-mannheim.de/cosmassru?operation=searchRetrieve&version=1.2&query=Gott (but edited).

<?xml version="1.0" encoding="UTF-8"?>

<sru:searchRetrieveResponse xmlns:sru="http://www.loc.gov/zing/srw/">
        <sru:version>1.2</sru:version>
        <sru:numberOfRecords>100</sru:numberOfRecords>
        <sru:records>
                <sru:record>
                        <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema>
                        <sru:recordPacking>xml</sru:recordPacking>
                        <sru:recordData>
                                <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGA.00000"
                                        ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGA.00000">
                                        <fcs:DataView type="kwic">
                                                <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
                                                        <kwic:c type="left"> und so will ich denn hier auch noch anführen, daß
                                                                ich in diesem Elend das neckische Gelübde getan: man solle, wenn ich
                                                                uns erlöst und mich wieder zu Hause sähe, von mir niemals wieder
                                                                einen Klagelaut vernehmen über den meine freiere Zimmeraussicht
                                                                beschränkenden Nachbargiebel, den ich vielmehr jetzt recht sehnlich
                                                                zu erblicken wünsche; ferner wollt' ich mich über Mißbehagen und
                                                                Langeweile im deutschen Theater nie wieder beklagen, wo man doch
                                                                immer </kwic:c>
                                                        <kwic:kw>Gott</kwic:kw>
                                                        <kwic:c type="right"> danken könne, unter Dach zu sein, was auch auf der
                                                                Bühne vorgehe. </kwic:c>
                                                </kwic:kwic>
                                        </fcs:DataView>
                                </fcs:Resource>
                        </sru:recordData>
                        <sru:recordPosition>1</sru:recordPosition>
                </sru:record>
                <!-- Some Records skipped -->
                <sru:record>
                        <sru:recordSchema>http://clarin.eu/fcs/1.0</sru:recordSchema>
                        <sru:recordPacking>xml</sru:recordPacking>
                        <sru:recordData>
                                <fcs:Resource xmlns:fcs="http://clarin.eu/fcs/1.0" pid="GOE/AGI.04846"
                                        ref="http://clarin.ids-mannheim.de/cosmassru/redirect/GOE/AGI.04846">
                                        <fcs:DataView type="kwic">
                                                <kwic:kwic xmlns:kwic="http://clarin.eu/fcs/1.0/kwic">
                                                        <kwic:c type="left">der</kwic:c>
                                                        <kwic:kw>"Gott"</kwic:kw>
                                                        <kwic:c type="right">leistet mir die beste Gesellschaft.</kwic:c>
                                                </kwic:kwic>
                                        </fcs:DataView>
                                </fcs:Resource>
                        </sru:recordData>
                        <sru:recordPosition>100</sru:recordPosition>
                </sru:record>
        </sru:records>
        <sru:echoedSearchRetrieveRequest>
                <sru:version>1.2</sru:version>
                <sru:query>Gott</sru:query>
                <sru:xQuery xmlns="http://www.loc.gov/zing/cql/xcql/">
                        <searchClause>
                                <index>cql.serverChoice</index>
                                <relation>
                                        <value>=</value>
                                </relation>
                                <term>Gott</term>
                        </searchClause>
                </sru:xQuery>
                <sru:baseUrl>http://clarin.ids-mannheim.de/cosmassru</sru:baseUrl>
        </sru:echoedSearchRetrieveRequest>
</sru:searchRetrieveResponse>

Query Expansion

For many search tasks the problem persists that resources are differently annotated. For instance, in case of sign language data several resources use different encodings for the observed signs and gestures. Regardless, when using a search infrastructure it should be possible to locate all (and only the) relevant examples.

Next to gestures there are also many other annotation layers on which the annotation systems differ, e.g., morphological segmentation, parts-of-speech, sense disambiguation, named entity labeling.

It is desirable that the endpoints contain the option to expand such annotations by having them expressed as ISOcat categories. If we use ISOcat the owner of the data can use RELcat to define expansion networks between their local categories and common categories. Using such relations ensures that there is a reasonably high accuracy of the mappings as it keeps the owner of the data and the endpoints in control of them.

For example the ISOcat category for nouns can be found at the following url: http://www.isocat.org/datcat/DC-1333.

Restricting the search

As mentioned above it is possible to restrict the content search to specific (sub)collections. A list of these (comma-seperated MdSelfLinks) can be passed on in the x-context parameter. We note that this list can get extensive. However, the clients calling the SRU endpoint can use HTTP POST instead of HTTP GET, which means that the parameters would be passed in the body of the request instead of in the URL. This has the consequence that the practical limit for the number of MdSelfLinks? and other arguments becomes a LOT higher.

For example, in order to only search for the occurence of the string "bellen" within the CGN corpus as offered by our hypothetical endpoint (used throughout this document) we could format the searchRetreive query as follows;

http://clarin_sru_endpoint?version=1.2&operation=searchRetrieve&x-context=hdl:1839/00-0000-0000-0001-53A5-2&query=bellen

In short, we pass a comma separated list of MdSelfLinks? (in this case consisting of only one) in the x-context parameter to restrict the search to the (sub)corpus as described in the CMDI files that have those MdSelfLinks?.