Version 41 (modified by Oliver Schonefeld, 10 years ago) (diff)


Warning: This information is deprecated! Please find current information at the Core Specification and supplementary Data Views Specification pages.

CLARIN Federated Content Search (CLARIN-FCS)

Authors: Herman Stehouwer, Dieter van Uytvanck, Oliver Schonefeld
Responsible: Herman Stehouwer
Purpose: Provide an overview of the CLARIN-FCS system and serve as a guide for implementation.


CLARIN Federated Content Search (CLARIN-FCS) is built upon the SRU/CQL standard. Search/Retrieval via URL SRU) specifies a general communication protocol for searching and retrieving records and the Contextual Query Language (CQL) specifies a extensible query language. This document specifies how CLARIN-FCS uses and extends SRU/CQL. For a proper understanding one needs to be familiar with the official documentation of the SRU/CQL standard, which can be found at

The design of the CLARIN-FCS system consists of two major parts:

  1. the communication protocol and the query language.
  2. the aggregator / user interface.

This document consists of two parts: the first part deals with the global design thoughts, whereas the second part deals with the technical specification and provides implementation guide for endpoints.

Global Design Thoughts


The CLARIN-FCS depends on the specification and implementation of the Virtual Language Observatory (VLO), the metadata search, the Virtual Collection Registry (VCR), and CMDI.

In general each CLARIN center participating within CLARIN-FCS will provide at least the following services:

  • provide one or more resources
  • support Content-search within those resources
  • return search-hits in the agreed-upon format
  • support query-expansion (if possible)
  • support the selection of a sub-part of the offered resources to perform content-search on that sub-part
  • provide support for the sub-part selection by providing CMDI metadata at the same, reasonable, granularity

Corpus Selection

We plan to show a list of available corpora (collections) that are searchable by the aggregator. Probably we need to allow people to make a rough selection first between types of collections (this could be relevant later). Each end-point serves one or more of these corpora/resources. If the endpoint serves more than one corpus/collection it must support the domain/scope selection method (as explained on the wiki).

Furthermore, after selecting one or more corpora the user can further constrain the search by determining which tiers or tier-types he or she wants to search in. The tiers or tier-types available for search are reported by the endpoints.

In the future we see ourselves building a shared research infrastructure, where the search functionality is also usable within other programs (by directly calling endpoints) and in combination with the metadata search, VLO, and virtual collection registry. In order to enable the extension of the scope selection into a useful infrastructure several conditions need to be met:

  1. CMDI metadata records need to be available for the resources at a useful granularity.
  2. those CMDI records should contain a unique identifier.
  3. the endpoints need to understand their own, corresponding unique identifiers.


The query entry field is where the user enters the query. These queries are passed on directly to the endpoints corresponding to the selected corpora (as above). We note that as we use CQL quite complex queries can be entered as well as simple keyword searches. These queries should be simple first (just one string such as "laufen"), but we also should allow for a simple syntax allowing to specify attributes of the string (tiers) such as indicated in the following examples: "laufen and css.pos=noun", "laufen and css.ges=stroke". This needs to be specified and should remain simple enough to start with. Of course it is up to the local search engines whether they can resolve this to useful queries in their repositories and how they can do it.

Tagsets and expanding the tags

Users will want to search for specific properties encoded in annotations. These annotations, from different corpora, may store the same concept in different ways. For instance in gesture encoding, morphological segmentation encoding, part-of-speech tags, or semantic information.

This is the issue of tag expansion. If "no-expansion" has been chosen, then the query should be taken literally and no available relations should be used. If "expansion" has been chosen then relations should be used. We argue that that expansion of the tag to all locally applicable tags by making use of RELcat should happen locally at the endpoint. After all, endpoints are meant to be used programmatically by many clients, not just our demonstrator. It's the local repository who best understand how things can be extended in a useful way. Having only the relevant expansion happen at the endpoint would make using them in infrastructures easier and more attractive.

Return format

We have defined the use of several return formats. These include, keyword in context, KML, TCF, EAF, and full-text. More information on the formats that are supported in the system can be found in the data view section in the technical design section.

Referring to an SRU endpoint from a CMDI file

This can be done with a ResourceProxy with ResourceType set to SearchService and mimetype set to application/sru+xml:

 <ResourceProxy id="d55">
   <ResourceType mimetype="application/sru+xml">SearchService</ResourceType>

For a complete example file see:

Technical Design / Implementation Guide


From a technical perspective, CLARIN-FCS is consists of two major parts:

  • the aggregator: a user interface to perform queries and display search results
  • several endpoints: a service that is provided by CLARIN centers who like to participate in CLARIN-FCS

The aggregator is a component that communicates with a component called endpoint, which is provided as a service by all centers who like to participate within the federated content search. Each endpoint provides access to one or more searchable resources. The content of these resources is searched with the query supplied to the endpoint. The endpoint returns results to this query in one or more representations, called data views, i.e. a keyword-in-context (KWIC) data view. The aggregator collects the responses from all the endpoints and displays them to the user. In order to facilitate this listing all endpoints are required to support returning results at least in the keyword-in-context data view.


The SRU protocol supports three different operations: Explain, Scan, and SearchRetrieve. Of these, the Explain operation is the default. Parameters are given either as HTTP GET or HTTP POST arguments. If no arguments are given the Explain operation shall be performed by an endpoint.

Generally, CLARIN-FCS is build upon the SRU/CQL standard (version 1.2), which is available from The following sections describe how CLARIN-FCS is mapped to SRU/CQL and how the protocol is extended to meet the requirements for CLARIN-FCS (e.g., the return format, the manner in which we pass restrictions of the search-space, the way in which we list the available corpora).

NOTE: Before starting to implement a endpoint at a center, developers are strongly encouraged to familiarize themselves with the SRU/CQL standard.

Explain Operation

This basic request serves to announce endpoints capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOcatted indexes as possible search indexes. If there is no ISOcat equivalent the FCS-context* set is to be used. The following XML fragment shows an example for defining various indexes (formatted for brevity):

  <set identifier="" name="isocat"/>
  <set identifier="" name="fcs"/>    
  <index id="?">
    <title lang="en">Part of Speech</title>    
      <name set="isocat" dcr:datcat="">partOfSpeech</name>
  <index id="?">
    <title lang="en">Words</title>
      <name set="fcs" dcr:datcat="">word</name>
  <index id="?">
    <title lang="en">Phonetics</title>
      <name set="fcs">phonetics</name>

The three indexes that are defined to be searchable on the collections/parts in this endpoint are then: isocat.partOfSpeech, fcs.word, and fcs.phonetics. An example query could for instance be fcs.word = child.

Note: The indexes given in the example are completely dynamic, i.e., they can differ from endpoint to endpoint. We have a clear TODO here to agree on a common set of required/desirable indexes!

The searchable indexes are used as a list in the "searchable tiers / tier-types" in the user-interface. It is up to the center to provide a complete list of searchable tier-types (e.g., full-text, transcription, morphological segmentation, part-of-speech annotation, semantic annotation, etc.) or searchable tier-names. It is obvious, that selectable tier-types trumps tier-names, e.g., "tr1, tr, trans, trans1" as a list is worse than "transcription" as a list. However, if a subdivision of the tiers into tier-types is not available, the tier-names should be provided as the user might be familiar with the corpus and have some use for them.

Furthermore, we may assume that annotation searches do not match running text and vice-versa. Generally this seems to be a reasonable assumption. However, there will be many corner cases where this assumption does not hold, e.g. the annotation "up" for a type of stroke (in the case of gesture-annotation) is quite likely to also occur in running text. So, while we can get decent results by simply searching, having the tier-type distinction would be better.

Scan Operation

Enumerating resources

Within CLARIN-FCS the scan operation is used to allow the calling agent (i.e. the aggregator) to enumerate all resources available for searching at a given endpoint. This in contrast to the definition in SRU, where Scan is a way to browse a list of keywords.

The enumeration process works as follows:

  1. The agent performs a initial scan operation with scanClause parameter of fcs.resource = root.
  2. The endpoint will return a list of <sru:terms>, which denote searchable collections. The element <sru:value> of each <sru:term> element is used as an identifier for a collection at an endpoint. It must be a valid MdSelfLink (which is defined as a URI). This identifier can than be used to restrict the search by passing one (or more) as parameters in the x-cmd-context parameter in the SearchRetrieve operation. The optional <sru:numberOfRecords> element of the <sru:term> element may contains the (approximate) number of searchable resources within a collection, the optional element <sru:displayName> should contain the human-readable name of the collection.
  3. To check for sub-collection, the agent performs a tree traversal. More specifically an agent substitutes the value root in the scanClause parameter for the identifier of a collection and performs another scan operation. The result is analogous to the one from the initial scan. If this collection has no sub-resources, the endpoint must return no terms et all.

Note: Providing an sufficient answer to the scan operation with the initial query of fcs.resource = root is mandatory, even if the endpoint only supports a single collection. In that case the endpoint must return only one <sru:term> within the <sru:terms>.
Note: The value root used in the initial scan request is a reserved value and therefore cannot be used to identify any real resource at the endpoint (And it's not a URI anyways).

Given the following example of collections and sub-collections (with invented Handles for collection identifiers).

+-"collection 1", hdl:4711/1
| +-"collection 1.1", hdl:4711/4
| +-"collection 1.2", hdl:4711/5
+-"collection 2", hdl:4711/2
+-"collection 3", hdl:4711/3
  +-"collection 3.1", hdl:4711/6

The enumeration process will be performed as follows:

  • request for fcs.resource = root returns terms for "collection 1", "collection 2" and "collection 3"
  • request for fcs.resource = hdl:4711/1 (= "collection 1") returns terms for "collection 1.1" and "collection 1.2"
  • request for fcs.resource = hdl:4711/4 (= "collection 1.1") returns no terms
  • request for fcs.resource = hdl:4711/5 (= "collection 1.2") returns no terms
  • request for fcs.resource = hdl:4711/5 (= "collection 2") returns no terms
  • request for fcs.resource = hdl:4711/3 (= "collection 3") returns terms for "collection 3.1"
  • request for fcs.resource = hdl:4711/6 (= "collection 3.1") returns no terms

A real-world example of a response to a scan request with a scanClause parameter of fcs.resource = root could look like the following (; formatted for brevity):

<sru:scanResponse xmlns:sru=""> 
      <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm> 
      <sru:displayTerm>Childes corpus</sru:displayTerm> 

Note that the values in the <sru:value> elements should be valid MdSelfLink. These MdSelfLinks should also be available from within the matching CMDI metadata file (via a reference in the Header section - see also below under "Restricting the search").

To find out about sub-collections of the CGN-Corpus from the preceding example, the agent would perform a scan request with a scanClause parameter of fcs.resource = hdl:1839/00-0000-0000-0001-53A5-2 and the result could look like the following (; formatted for brevity):

<sru:scanResponse xmlns:sru=""> 
      <sru:displayTerm>Annotation types</sru:displayTerm> 

So in this scan response the endpoint announces that the CGN-Corpus (identified with the unique MdSelfLink hdl:1839/00-0000-0000-0001-53A5-2) has 3 sub-collections:

  • Annotation types (containing 300 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-467E-9)
  • Components (containing 400 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-4682-F)
  • Regions (containing 350 elements, identified by the MdSelfLink hdl:1839/00-0000-0000-0003-4692-D)

These results can then later on be used to restrict the content search to one (or more) (sub-)collections.

Extended Resource Information

Re-using the elements, that are provided by SRU in a <sru:term> does not allow to put more sophisticated information about a resource a light-wight extension has been defined. It's sole purpose is to provide a more extended set of information about a resource that can be leveraged by the aggregator to provide better information to the end user.

If an agent performs an scan request with the additional parameter x-cmd-resource-info set to the value true, a CLARIN-FCS conforming endpoint should put an <ResourceInfo> element into the <sru:extraTermData> element inside the <sru:term> element of the scan response. The appropriate XML schema for the <ResourceInfo> can be found at Scan-Resource-Info.xsd (download). The following XML fragment shows an example for <ResourceInfo>, which is supposed to be put inside the <sru:extraTermData> element (formatted for brevity):

<ResourceInfo xmlns="" hasSubResources="false">
    <Title xml:lang="en">Example Corpus 1</Title>
    <Title xml:lang="de">Beispiel-Korpus 1</Title>
    <Description xml:lang="en">An corpus of literature text.</Description>
    <Description xml:lang="de">Ein Korpus von Literatur-Texten.</Description>

Each <ResourceInfo> element may contain the following data:

  • one or more <Title> elements; each with a mandatory xml:lang attribute: a human readable title of the resource.
  • zero or more <Description> elements; each with a mandatory xml:lang attribute: a short (= one sentence) description of the resource. It's intended to be used, e.g. for a tooltip in the aggregator GUI.
  • zero or one <LandingPageURI>: a link to a landing page for this resource. It's intended to be used, e.g. for a linking to the resource from the aggregator.
  • one mandatory <Languages> element containing one or more <Language> elements: this shall be the enumeration of all (relevant) languages within the resource. It's supposed to be represented as an ISO 639-3 three letter language code.
  • the element <ResourceInfo> may carry an optional has-sub-resources attribute: this attribute has a boolean value and serves to give the aggregator a hint, if a recursive scan on this collection would yield more information about sub-resources of this resource, i.e. it should be set to true is the resource contains sub-resources. If the resource does not contain any sub-resources, this attribute can be omitted or set to false.

NOTE: An endpoint must not send this data unless explicitly asked by the agent by adding the x-cmd-resource-info parameter to the request. If an endpoint sends this data prematurely, it violates the SRU specification (see also

NOTE: Of course, the aggregator could also analyze the the CMDI records for the resources, but there is currently no general available CLARIN component, to perform this task therefore this method is currently proposed. Endpoint implementations are free, how they store this information, e.g. centers could opt to process their own metadata to extract the necessary information, configure it statically in their endpoint implementation or use any other mechanisms to provide this information.

SearchRetrieve Operation


The SearchRetrieve operation is the operation that is used for searching in the resources that are provided by the endpoint. The SRU standard defines a XML structure how to encode the results down to a record level, i.e. <sru:record> elements. SRU allows records to be serialized in multiple formats, so called record schemas. For CLARIN-FCS, a custom record schema has been defined. The record schema identifier for this schema is and the appropriate XML Schema can be found at Resource.xsd (download).

In CLARIN-FCS, each <sru:record> element represents one hit within the resource, which is encoded as a <fcs:Resource> element. Each resource shall be identified a persistent identifier (or, less preferably, a endpoint unique URI). The correct resource to return here is the most precise unit of data that is directly addressable as a "whole". The hit may contain a resource fragment, which is encoded as as a <fcs:ResourceFragment> element. The resource fragment shall be addressable within a resource, i.e. it has an offset or a resource-internal identifier. Using resource fragments is optional, but encouraged.

The actual hit within a resource is provided encoded as a data view format and is serialized as a <fcs:DataView> element inside either the <fcs:Resource> or <fcs:ResourceFragment> element. Each hit may be serialized as multiple data views, however the keyword-in-context (KWIC) data view is mandatory with the resource fragment (if applicable), or otherwise within the resource (if there is no reasonable resource fragment). Other data views should be put in a place that is logical for their content (as is to be determined by the endpoint. E.g. a metadata data view would most likely be put directly under a resource. On the other hand a data view representing some annotation layers directly around the hit is more likely to belong in within the resource fragment.

Each entity (i.e. <fcs:Resource>, <fcs:ResourceFragment> or <fcs:DataView> element) contains a ref attribute, which points to the original data represented by the resource, resource fragment, or data view as well as possible. It should always be possible to directly link to the resource itself. Worst case this will be a web-page describing a corpus or collection (including instruction on how to obtain it). Best case it directly links to a specific file or part of a resource in which the hit was obtained. The latter is not always possible, and when possible often constrained by licensing issues. Endpoints should provide links that are as specific as possible/logical.

Data View and Data View formats

The data views are designed to allow for different representations of search results within CLARIN-FCS. They are deliberately kept open to allow further extensions in the future with more supported data view formats.

The type of each data view is identified by the type attribute of the <fcs:DataView> element. The value if defined to be a MIME type. If no existing MIME type can be used, implementors are encouraged to define a properer private mime type. The following formats are currently being considered:

Keyword-In-Context (KWIC)
Description: a keyword-in-context view, where each hit should be presented within the context of a complete sentence (if possible) or any other reasonable unit of context (e.g. if sentences cannot be determined by the endpoint). The keyword-in-context data view is mandatory for all endpoints. The appropriate XML schema can be found at Resource-KWIC.xsd (download).
Type: application/x-clarin-fcs-kwic+xml
Example for a keyword-in-context data view (formatted for brevity):
<fcs:DataView type="application/x-clarin-fcs-kwic+xml">
  <kwic:kwic xmlns:kwic="">
    <kwic:c type="left">Some text with the </kwic:c>
    <kwic:c type="right">highlighted.</kwic:c>
CMDI metadata
Description: a CMDI metadata record applicable to the specific context, i.e., if contained inside a resource element it should apply to the entire resource, and if contained inside a resource fragment is should apply specifically to the resource fragment (i.e., more specialized than the metadata for the encompassing resource). The minimal XML schema for CMDI records can be found at minimal-cmdi.xsd (download). For further information refer to the Component Metadata pages.
Type: application/x-cmdi+xml
Geolocation (KML)
Description: A geographic location encoded in the Keyhole Markup Language. Please refer to the KML standard for further information and the current KML XML schema.
Type: application/

Currently, the inclusion of three different content-container data views is being discussed:

  1. Fulltext; this data view simply contains a (reasonably sized) chunk of running text. This format would allow CLARIN centers to make un-annotated full-text data available. Some examples here are the Goethe and the El-Pais corpora.
  2. EAF: this data view contains the EAF format. The EAF format is the foremost format for annotating time-aligned data such as on audio or video files and it is widely used by CLARIN partners. By providing such a format researchers can study detailed information on the hits found by the search architecture.
  3. TCF: this data view contains the TCF format as specified for the WebLicht program. The format would allow users to further process data within the WebLicht.

Restricting the search

In general, an endpoint is required to accept an unrestricted search and perform the search operation on all resources. An agent can ask an endpoint to restrict the search to a sub-set of the available resources. This is done by passing a comma-separated list of resource identifiers (MdSelfLinks obtained from a scan operation) in the x-cmd-context parameter of the SearchRetrieve request. The list can get extensive, but an agent can use HTTP POST instead of HTTP GET, which means that the parameters would be passed in the body of the HTTP request instead of appending them to the URL. This has the consequence that the practical limit for the number of MdSelfLinks and other arguments becomes a lot higher.

For example, in order to only search for the occurrence of the string "bellen" within the CGN corpus as offered by a (hypothetical) endpoint, the query could be formatted as follows:

Query Expansion

For many search tasks the problem persists that resources are differently annotated. For instance, in case of sign language data several resources use different encodings for the observed signs and gestures. Regardless, when using a search infrastructure it should be possible to locate all (and only the) relevant examples.

Next to gestures there are also many other annotation layers on which the annotation systems differ, e.g., morphological segmentation, parts-of-speech, sense disambiguation, named entity labeling, etc.

It is desirable that the endpoints contain the option to expand such annotations by having them expressed as ISOcat categories. If ISOcat is used the owner of the data can use RELcat to define expansion networks between their local categories and common categories. Using such relations ensures that there is a reasonably high accuracy of the mappings as it keeps the owner of the data and the endpoints in control of them.

For example the ISOcat category for nouns can be found at the following URL:

A real-world example of a response to a search retrieve request with a query parameter of Gott could look like the following; formatted for brevity):

<?xml version="1.0" encoding="UTF-8"?>
<sru:searchRetrieveResponse xmlns:sru="">
        <fcs:Resource xmlns:fcs="" pid="GOE/AGA.00000"
          <fcs:DataView type="kwic">
            <kwic:kwic xmlns:kwic="">
              <kwic:c type="left">
                und so will ich denn hier auch noch anführen, daß ich in diesem Elend das neckische
                Gelübde getan: man solle, wenn ich uns erlöst und mich wieder zu Hause sähe, von mir
                niemals wieder einen Klagelaut vernehmen über den meine freiere Zimmeraussicht
                beschränkenden Nachbargiebel, den ich vielmehr jetzt recht sehnlich zu erblicken
                wünsche; ferner wollt' ich mich über Mißbehagen und Langeweile im deutschen Theater
                nie wieder beklagen, wo man doch immer
              <kwic:c type="right">
                danken könne, unter Dach zu sein, was auch auf der Bühne vorgehe.
    <!-- some records skipped -->
        <fcs:Resource xmlns:fcs="" pid="GOE/AGI.04846"
          <fcs:DataView type="kwic">
            <kwic:kwic xmlns:kwic="">
              <kwic:c type="left"> der </kwic:c>
              <kwic:kw> "Gott" </kwic:kw>
              <kwic:c type="right"> leistet mir die beste Gesellschaft. </kwic:c>
    <sru:xQuery xmlns="">

SRU Aggregator

The aggregator consists of 2 parts:

  • the backend, responsible for preforming all necessary communication with the endpoints
  • the frontend, a web-based GUI that provides an interface for end users to the functionality of the backend

Input for the aggregator backend

The aggregator backend is also an SRU/CQL server, however it does not function as an endpoint. Instead it distributes the incoming queries to the endpoints it knows and then aggregates the results.

The aggregator leverages two mechanisms to provide it with the list of endpoints it is supposed to talk to:

  1. It knows the endpoints by querying the CLARIN center registry?.
  2. It also accepts links to CLARIN-compatible endpoints explicitly given parameter in the x-aggregation-context parameter (see JSON example and notes below).

The aggregator provides two mechanisms to restrict the search to certain (sub-)collections:

  1. It sends a map of endpoint-URL and list of resource identifiers (e.g. MdSelfLink) encoded in JSON as x-aggregation-context parameter for SearchRetrieve
  2. In case the client only knows the MdSelfLink it can retrieve the endpoint URL via the ResourceProxy of the type SearchService with the mimetype application/sru+xml that can be found when resolving the MdSelfLink

The following example illustrates the JSON structure (formatted for brevity):

  "": [
  "": [
  "": [

The aggregator is asked to search the resources identified by "hdl:1839/00-0000-0000-0003-467E-9", "hdl:1839/00-0000-0000-0003-4682-F" and "hdl:1839/00-0000-0000-0003-4692-D" at endpoint "" and the resource identified by "hdl:4711/1 at endpoint "". The aggregator is asked to perform an unrestricted query at endpoint "". This is signaled to the aggregator by leaving the list of resource identifiers for the respective endpoint empty.

A real-world example to restrict the search at the aggregator (using HTTP POST, and searching for the string "bellen") could look like the following (formatted for brevity):


operation = searchRetrieve &
version = 1.2 &
query = bellen &
x-aggregation-context = {
  "": [

Note: The aggregator should usually only use endpoint URLs, that are registered in the CenterRegistry, i.e. filter all endpoints URL against the list of endpoints in the CenterRegistry. This is to prevent the risk of (DDOS) attacks to third party web-servers. Maybe certain privileged users could have the option to add arbitrary URLs to the list of endpoints to be queried by the aggregator.
Note: A few thoughts about scalability: a list of about 100.000 pairs could result in a post body size of about 5 MB, which should work with any reasonable web-server and Servlet container; the most expensive operation will take place at the end points, i.e. correctly restricting the search given a list of collection identifiers (MdSelfLinks) and performing the search.