wiki:FCS-specification

Version 1 (modified by herste, 12 years ago) (diff)

--

CLARIN Federated Search

Authors: Herman Stehouwer, Dieter van Uytvanck Responsible: Herman Stehouwer Purpose: Provide an overview of the FCS and serve as a guide for implementation.

Overview

The design of the federated search system consists of two major parts: 1 The communication protocol and the query language. 2 The aggregator / user interface. This document deals with both parts. The first part of this document deals with the user interface and the global design thoughts, whereas the second part deals with the technical specification (communication protocol and query language).

The federated search depends on the specification and implementation of the VLO, the metadata search, the virtual collection registry, and CMDI.

In general each CLARIN-center participating will provide at least the following services:

  • Provide one or more resources
  • Support Content-search within those resources
  • Return search-hits in the agreed-upon format
  • Support query-expansion if possible
  • Support the selection of a sub-part of the offered resources to perform content-search on that sup-part
  • Provide support for the sub-part selection by providing CMDI metadata at the same, reasonable , granularity

Global Design Thoughts / The Aggregator

==Overview== The design of the federated search system consists of two major parts: 1 The communication protocol and the query language. 2 The aggregator / user interface. This part discusses the user-interface, wishes we have for the user interface, and how this affects or is affected by the protocol choice.

The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that for a proper understanding one needs to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.

Corpus Selection

We plan to show a list of available corpora (collections) that are searchable by the aggregator. Probably we need to allow people to make a rough selection first between types of collections (this could be relevant later). Each end-point serves one or more of these corpora/resources. If the endpoint serves more than one corpus/collection it must support the domain/scope selection method (as explained on the clarin.eu wiki).

Furthermore, after selecting one or more corpora the user can further constrain the search by determining which tiers or tier-types he or she wants to search in. The tiers or tier-types available for search are reported by the endpoints.

In the future we see ourselves building a shared research infrastructure, where the search functionality is also usable within other programs (by directly calling endpoints) and in combination with the metadata search, VLO, and virtual collection registry. In order to enable the extension of the scope selection into a useful infrastructure several conditions need to be met: 1 CMDI metadata records need to be available for the resources at a useful granularity. 2 Those CMDI records should contain a unique identifier. 3 The endpoints need to understand their own, corresponding unique identifiers.

Queries

The query entry field is where the user enters the query. These queries are passed on directly to the endpoints corresponding to the selected corpora (as above). We note that as we use CQL quite complex queries can be entered as well as simple keyword searches. These queries should be simple first (just one string such as “laufen”), but we also should allow for a simple syntax allowing to specify attributes of the string (tiers) such as indicated in the following examples: “laufen and css.pos=noun”, “laufen and css.ges=stroke”. This needs to be specified and should remain simple enough to start with. Of course it is up to the local search engines whether they can resolve this to useful queries in their repositories and how they can do it.

Tagsets & Expanding the tags

Users will want to search for specific properties encoded in annotations. These annotations, from different corpora, may store the same concept in different ways. For instance in gesture encoding, morphological segmentation encoding, part-of-speech tags, or semantic information.

This is the issue of tag expansion. If “no-expansion” has been chosen, then the query should be taken literally and no available relations should be used. If “expansion” has been chosen then relations should be used. We argue that that expansion of the tag to all locally applicable tags by making use of RELcat should happen locally at the endpoint. After all, endpoints are meant to be used programmatically by many clients, not just our demonstrator. It’s the local repository who best understand how things can be extended in a useful way. Having only the relevant expansion happen at the endpoint would make using them in infrastructures easier and more attractive..

Return format

We have defined the use of several return formats. These include, keyword in context, KML, TCF, EAF, and full-text. More information on the formats that are supported in the system can be found in the technical design section.

Technical Design / Implementation Guide

The user-interface (or aggregator) is a component that communicates with all endpoints within the federated search. Each endpoint offers some searchable resources. The content of these resources can be searched. Depending on the needs of the user and the capabilities of the endpoint several return formats are supported. The user-interface collects the responses from all the endpoints and shows these to the user. In order to facilitate this listing all endpoints will always return the results at least in the keyword-in-context format.

The base of the communication protocol is SRU and the base of the query language is CQL. The wiki on trac.clarin.eu talks about the protocol and the query language in detail. We remark that it is helpful to be familiar with the official documentation of the SRU/CQL standard, which can be found at http://www.loc.gov/standards/sru/.

SRU

The SRU protocol supports three different operations: Explain, Scan, and SearchRetrieve?. Of these, the Explain operation is the default. Parameters are given as HTTP GET or HTTP POST arguments. If no arguments are given the explain operation is performed.

We again refer to the standard at http://www.loc.gov/standards/sru/ (version 1.2) where one can study the details of the protocol. Below we will only discuss the ways in which our implementations differ from the basic protocol, as well as the rest of our agreements (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora).

As a namespace for our extensions to the SRU XML we use “http://clarin.eu/fcs/1.0”. The schema that validated our extension is found at: http://www.clarin.eu/system/files/Resource.xsd

Explain

This basic request serves to announce server's capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the CCS-context* set is to be used. We provide a telling example (as seen within the context of the explain response as defined on the SRU/CQL website):

<indexInfo>
<set identifier="isocat.org/datcat" name="isocat"/>
<set identifier="clarin.eu/schema/ccs-v1.0" name="ccs"/>    


<index id="?">
<title lang="en">Part of Speech</title>    
<map><name set="isocat">partOfSpeech</name></map>
    </index>

    <index id="?">
      <title lang="en">Words</title>
      <map><name set="ccs">word</name></map>
    </index>

    <index id="?">
      <title lang="en">Phonetics</title>
      <map><name set="ccs">phonetics</name></map>
    </index>         
</indexInfo>

The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: isocat.partOfSpeech, ccs.word, and ccs.phonetics. An example query could for instance be ccs.word = child.

The searchable indexes are used as a list in the “searchable tiers / tier-types” in the user-interface. It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. We consider it obvious that selectable tier-types trumps tier-names, e.g., “tr1, tr, trans, trans1” as a list is worse than “transcription” as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them.

Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold. E.g., the annotation “up” for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better.

Scan

We foresee the scan operation as a way of signaling to the calling program/user/aggregator the available resources available for searching at the endpoint. This in contrast to the definition in SRU, where scan is a way to browse a list of keywords. The value of the scanClause parameter should be cmd.collection.

To this the endpoint will return a list of terms, which are searchable collections. Their identifiers can than be used to restrict the search by passing one (or more) as parameters in x-cmd-context in the searchRetrieve operation.

Again, we provide a telling example:

<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" > 
<sru:version>1.2</sru:version> 
  <sru:terms>  
    <sru:term> 
          <sru:value>MPI86949#</sru:value> 
          <sru:numberOfRecords>12098</sru:numberOfRecords> 
          <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>MPI1296694#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>Childes corpus</sru:displayTerm> 
    </sru:term> 
  </sru:terms> 
  <sru:echoedScanRequest> 
    <sru:version>1.2</sru:version> 
    <sru:scanClause>cmd.collections</sru:scanClause> 
    <sru:responsePosition></sru:responsePosition> 
    <sru:maximumTerms>42</sru:maximumTerms> 
  </sru:echoedScanRequest> 
</sru:scanResponse>

Note that the values in the sru:value elements should be valid PID. These PIDs are ideally also available from within the matching CMDI metadata file . (see also below under “Restricting the search”.

SearchRetrieve?

The SearchRetrieve? operation is the operation that is used for searching in the resources that are provided by the endpoint. The responds provides an XML wrapper to a set of results. Here we follow the SRU standard (verison 1.2) down to the <sru:record> elements. Each <record> represents one hit of the query on the data.

Within each record we use our own XML structure that matches the concept of searchable resources. Each record contains one resource. The resource has a PID. The correct resource to return here is the most precise unit of data that is directly addressable as a “whole”. This resource might contain a resourceFragment. The resourceFragment has an offset, i.e., a resource fragment is a subset of the resource that is addressable. Using a resourceFragment is optional, but encouraged.

Within both the resource and the resourceFragment there can be dataView elements. At least one kwic dataView element is required, within the resourceFragment (if applicable), otherwise within the resource (if there is no resourceFragment). Other dataViews should be put in a place that is logical for their content (as is to be determined by the endpoint, e.g., a metadata dataView would most likely be directly under a resource. On the other hand a dataView representing some annotation layers directly around the hit is more likely to belong in the resourceFragment.)

Each element (resource, resourceFragment, dataview) will contain a “ref” attribute, which points to the original data represented by the resource, fragment, or dataview as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a webpage describing a corpus or datacollection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. We will strive to provide links that are as specific as possible/logical.

We already define several allowed dataView formats below. Further extention in the future with more supported formats is deliberately kept open. Currently an active discussion is happening within the CLARIN-D community about which formats to use.

Below we also cover a second topic related to the searchRetrieve operation, the expansion of search queries.

Return formats (DataView?)

There are several dataviews agreed upon. Each dataView will have an attribute “type”, which has as value the type of dataView contained. It is possible to, in the future, add different dataviews if required. It is mandatody to support the KWIC dataview (as this type is fairly straightforward to show as a list of results). Our KWIC dataView looks as follows:

<ccs:DataView type="kwic">
 	<c type="left">Some text with </c>
	<kw>keyword</kw>
	<c type="right">highlighted</c>
</ccs:DataView>

Another possible dataView is currently CMDI. The metadata should be applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resourceFragment element is should apply specifically to the resourceFragment (i.e., more specialized than the metadata for the encompassing resource).

Furthermore, the geographic format KML is valid as a dataView. (see http://www.opengeospatial.org/standards/kml for a specification).

Finally, we also specify the possible inclusion of three different content-container dataViews: 1 Fulltext; this dataView simply contains a (reasonably sized) bunch of running text in between the dataView tags. By providing this format it allows the CLARIN centers that have the possibility of making available full-text un-annotated data to make available such data. Some examples here are the Goethe and the El-Pais corpora. 2 EAF: this dataView contains the EAF format as specified elsewhere. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture. 3 TCF: this dataView contains the TCF format as specified for the WEBLICHT program. By having data available in TCF (where applicable) we create a more integrated search architecture, where search results can be further processed in a weblicht toolchain.

Query Expansion

For many search tasks the problem persists that resources are differently annotated. For instance, in case of sign language data several resources use different encodings for the observed signs and gestures. Regardless, when using a search infrastructure it should be possible to locate all (and only the) relevant examples.

Next to gestures there are also many other annotation layers on which the annotation systems differ, e.g., morphological segmentation, parts-of-speech, sense disambiguation, named entity labeling.

It is desirable that the endpoints contain the option to expand such annotations by having them expressed as ISOcat categories. If we use ISOcat the owner of the data can use RELcat to define expansion networks between their local categories and common categories. Using such relations ensures that there is a reasonably high accuracy of the mappings as it keeps the owner of the data and the endpoints in control of them.

For example the ISOcat category for nouns can be found at the following url: http://www.isocat.org/datcat/DC-1333.

Restricting the search

It is possible to restrict the search to a list of PIDs that match resources. These are passed in the x-cmd-context parameter. We note that this list can get extensive, however we foresee two ways to deal with this problem.

First, the clients can use HTTP POST instead of HTTP GET, which means that the parameters would be passed in the body of the request instead of in the URL. This has the consequence that the practical limit for the number of PIDs and other arguments becomes a LOT higher.

Second, we propose the implementation of a virtual collection registry (VCR ). A PID could point to a specific virtual collection, which would result in the endpoint querying the virtual collection about the search-domain (the part of it relevant for the specific endpoint. Discussions on the VCR are active and ongoing within CLARIN-D.

However, it is clear that the VCR will have at least two types of collections: (1) intentional collections, i.e., collections which represent the will of the user for specific records such as a metadata query, and (2) explicit collections, i.e., collections which simply contain a list of PIDs.