wiki:FCS-spec

Version 44 (modified by vronk, 12 years ago) (diff)

--

Towards a Specification of the FCS-API for individual Content Providers

There is a more current version: FCS-specification

Target Audience: Technical Staff of Content Providers

Specification of the interface that Content Provider willing to join the federated search have to implement. This builds upon the SRU/CQL protocol, but concentrates mainly on the specific agreements on top of the protocol.

While this page discuss details of individual features, FCS-FeatureMatrix is a more compact yet (to be) complete summary of the features to be implemented.

SRU basics

(We have to ensure, that our specification is compatible with the established implementation and usage of the protocol.)

Context Sets

The protocol allows to define own "context sets" (~ namespace) to bind new indices, relations and operators to. A number of context sets is already defined, at least some of which we should also support (e.g. dublincore).

We propose further context sets to accomodate our special needs:

isocat - isocat.org/datcat
for Data Categories defined in ISOcat Data Category Registry
fcs - clarin.eu/fcs/1.0
CLARIN Content Search - for indices on content (Annotation Tiers)
cmd - clarin.eu/cmd
CLARIN/Component Metadata - for metadata based indices.

It needs to be elaborated further how to integrate existing context sets providing metadata-fields like dublincore I.e. what would be the relation between cmd and dc context sets. The problem is, that CMD shall be the context set for all the indices that are thinkable/usable based on the CMD-Profiles. And for example dublincore is also defined as a profile in CMD. But on the other hand dublincore is a model/format on its own, is widely used in the established federated search world (libraries, harvesting etc.), and in particular it already has its own context set in SRU/CQL. So it seems inacceptable to force the repositories to recode this (dc: to cmd:).
The question seems to be if dc.title and cmd.dc.title can be seen as equivalent. (While it is clear that cmd.title is not strictly equivalent but rather an (ambiguous) superset, because it would mean title-element from all profiles.)

Explain operation

This basic request serves to announce server's capabilities and should allow the client to configure itself automatically.

The explain response should, ideally, provide a list of ISOCatted indexes as possible search indexes. If there is no ISOCat equivalent the CCS-context set is to be used.

Example (tentative):

 <indexInfo>
  <set identifier="isocat.org/datcat" name="isocat"/>
  <set identifier="clarin.eu/schema/ccs-v1.0" name="ccs"/>

  <index id="?">
    <title lang="en">Part of Speech</title>
   <map><name set="isocat">partOfSpeech</name></map>
 </index>       

 <index id="?">
    <title lang="en">Words</title>
   <map><name set="ccs">words</name></map>
 </index>       

  <index id="?">
    <title lang="en">Phonetics</title>
   <map><name set="ccs">phonetics</name></map>
 </index>       

</indexInfo>

searchRetrieve operation

The operation to send the actual query.

Search result

The common ground is the <sru:searchRetrieveResponse> defined by the protocol. This goes down to the <sru:record> wrapping element. The proposition is to continue with a generic structure being able to encompass "all" the various types of information. (Please also consider the draft of the schema FederatedSearch/ccsResource.xsd):

 <ccs:Resource pid="{pid of the resource}">  
    <ccs:DataView type="metadata" pid="{PID of the CMD-record}" schema="">  /* pid is optional  */
     /* this is for metadata provided directly by the content provider 
         * CMD-format is prefered if available, */
       <CMD>...
       </CMD>
     /* but other recognized formats, like dublincore, should be accepted as well: */
       <dc:title></dc:title>
       <dc:author></dc:author>
   
     /* in the worst case (if the metadata-record consists just of a key-value-list - a serialized form may be used: */
        <ccs:f key="{any metadata-key}">{any metadata value}</ccs:f>       

    </ccs:DataView>
    <ccs:ResourceFragment pid="{offset relative to the parent resource PID}">           
       <ccs:DataView type="metadata" schema="">   
            /* metadata pertaining to the specific (matching) fragment, 
             * like metadata on the "current" speaker, or matching chapter of a book */
       </ccs:DataView>

      <ccs:DataView type="kwic">
         <c type="left" >Some text with </c>
         <kw>keyword</kw>
         <c type="right" >highlighted</c>
      </ccs:DataView>           
    
      <ccs:DataView type="text/xml" schema="{meertens/schema}"><meertens:any/>     
      </ccs:DataView>           

      <ccs:DataView type="image/jpeg" ref="{link to the data}" >
      </ccs:DataView>  
    </ccs:ResourceFragment>
 </ccs:Resource>
Resource
element representing a resource, carrying the identifier. It may represent anything that has a PID (and a MDRecord). So in particular it may also be collections, aggregating other Resources. Allowed children are: Resource, ResourceFragment and DataView
ResourceFragment
A part of a resource, without own PID, i.e. something addressable with: PID of the Resource + Fragment Identifier. Fragment Identifier to be used depends on the resource type, it may be: XPointer, timecode, sequence-offset, etc. Allowed children are: DataView
DataView
Element carrying the typed data. Content can be anything that is in other namespace and it can be either inline or referenced via @pid/@ref-attributes.
@pid
Attribute allowed on every element, identifying the Resource/ResourceFragment/DataView.
PID should also be resolvable, so the @ref-attribute
@ref
Attribute allowed on every element, refering to the Resource/ResourceFragment/DataView via a URL.
TODO: Decide if @pid and @ref shall both be used, and if yes, then if they can be used in parallel.
@type
Attribute on DataView identifying the type of the content. Regarding the allowed values for this attribute - some basic ones are described in the next chapter #DataViews
, but there seems to be a semantic overlap with mime-types, thus the proposal to allow mime-types as values of the @type-attribute.
If description via @type is not sufficient, for further specification of the format of the (referenced) content DataView-element may carry a @schema-attribute:
@schema
Attribute on DataView referencing the schema that describes the content of given DataView.

(An alternative would be to flatten the structure, but for now, we go with the nested one. Example of a flat structure:

<sru:recordData>
   <ccs:ResourcePID>{PID of the resource}</ccs:ResourcePID> 
   <ccs:ResourceFragmentPID>{identifier of the resource-fragment (relative to Resource-PID?)}</ccs:ResourcePID> 
   <ccs:CMDPID>{PID of the CMD-record}</ccs:CMDPID>  /* optional */

   <dc:title>{title of the resource}</dc:title>  
     /* basically any metadata-fields as is standard in SRU-world */
   ... 
  
   <ccs:DataView type="kwic">Some text with <kw>keyword</kw> highlighted</ccs:DataView>         
   <ccs:DataView type="text/xml"><meertens:any/>     
    </ccs:DataView>             
   <ccs:DataView type="image/jpeg" ref="{optional link  to the data}" ></ccs:DataView>  
</sru:recordData>

)

Data Views

Here we propose several types of DataViews, the actual format has to be yet defined for most of them, but we should reuse existing formats where possible, so we should look at existing practices and data, but at the same time avoid overspecializing on some specific format. (The CLARIN deliverable Interoperability and Standards (2,7 MB) can be used as a starting point.) Discussion about the relationship between data types and corresponding "Viewers", i.e. means of displaying the information to the user under Viewable.

title
a simple text, that can serve as representative for given Resource or ResourceFragment. It could be title of a book or chapter, Name of the Resource, or Lemma in a Lexicon
kwic
Keyword in context both keyword and (left/right) context are wrapped into elements, to avoid mixed-content
<ccs:DataView type="kwic">
  <c type="left">Some text with </c><kw>keyword</kw> <c type="right">highlighted</c>
</ccs:DataView>         
Alternatively every token could be wrapped as a element:
<ccs:DataView type="kwic">
    <t id="t1">Some</t> 
    <t id="t2">text</t> 
    <t id="t3" >with</t> 
    <t id="t4" kw="1">keyword</t> 
    <t id="t5">highlighted</t>
</ccs:DataView>         
This comes close to the way the text is encoded in TCF and would accordingly allow to add (stand-off) annotation layers (lemma, POS, but also syntactic annotations). This shall be a separate DataView?-type.
content
A basic DataView-type, to be used when nothing more specific applies. Especially this is to be used, if only plain text can be delivered as result, not distinguishing the keyword and the context (as required in kwic-DataView)
 <ccs:DataView type="content">Some text with the matching keyword. However the keyword is NOT highlighted</ccs:DataView>
metadata
If there is some associated metadata (like bibliographic information about the source of the hit, this is to be encoded in a separate element
  <ccs:DataView type="metadata">
optional (but strongly encouraged) element carrying metadata about the Resource or ResourceFragment.

The metadata can be inline or referenced via attributes @pid or @ref. This can be basically any kind of md-record, but order of preference is:

  1. CMD-record
  2. some recognized format like dublincore, OLAC
  3. a "serialized" format with flat list of <f key="{field-name}">-elements
Geographic data
A geographic location, either as coordinates or some location (street, city, place). One established format is KML
Lexicon Entry
A entry from an lexicon, dictionary or similar. Something with lemma with some information about it. There are well established format for lexical and terminological data like Lexical Markup Framework (LMF) or Terminological Markup Framework (TMF - ISO16642).
List
A list of things. These are not primary resources, but rather derived information usually aggregations / frequency lists. This is similar to the scan-operation, or in other words: the result of a scan-operation is also such a list.
  [<key, number, link?>]
Example:
  Haus  45
  Liebe 60
This should enclose also nested lists
Matrix
A matrix containing things. Table as a special type of matrix? Multidimensional? To be defined.
Annotated Text
A bunch of annotated text. We start by supporting the TCF and EAF format as they have existing viewers. For an example EAF file see: sample file For now we have annotations/eaf as type and annotations/tcf as type.
<ccs:DataView type="annotations/eaf" ref="http://corpus1.mpi.nl/qfs1/media-archive/demo/pewi/Annotations/elan-example1.eaf"
</ccs:DataView>
Syntax tree
A special type of annotation. There are dedicated formats for syntactic annotation (Penn Treebank, NeGra? Format, SynAF). TCF can also describe syntax trees.

restricting the search by collections

Restricting the search space shall be done via x-cmd-context-parameter (obsoleting: x-cmd-collections). See more under SearchContext

Scan operation

Note: This section is not discussed/"peer-reviewed" yet. It's just my proposal. Matej

The SRU protocol foresees the `scan`-operation to find out the available terms in and index. We propose to use the scan-operation also to find-out available collections available at the provider. The request would like this:

 ?operation=scanClause &scanClause=cmd.collection

The retrieved values can be subsequently used as values of the x-cmd-context-parameter in the search-request, to restrict the search to specific Resources/Collections?.

This basic scenario wouldn't require any changes in the protocol, just a slight change in the interpretation of the request: when specifying a starting point of the search in the scanClause, it would have to be interpreted as a parent-node in a tree rather than a term in a flat list:

 ?operation=scanClause &scanClause=cmd.collection={cmd-pid of a parent-collection} 

However usually resources in repositories are structured in a tree of collections, much like a file-system. We could already traverse such a tree, with the method as described above, by calling it separately on every node to get its children. However if we want to allow an more effective retrieval of the nested structures we will need appropriate extension to the result format and also an extension parameter in the request is necessary.

First, an additional parameter would be needed to define desired depth of the response. SRU dictates x- as prefix for extension parameters, so x-cmd-maximum-depth' would be one possibility.

Second, more profound change would be needed in the result. Following is an example of a valid scan-response:

<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" > 
<sru:version>1.2</sru:version> 
  <sru:terms>  
    <sru:term> 
          <sru:value>MPI86949#</sru:value> /* collection-identifier (CMD-PID?) */ 
          <sru:numberOfRecords>12098</sru:numberOfRecords> 
          <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>MPI1296694#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>Childes corpus</sru:displayTerm> 
    </sru:term> 
  </sru:terms> 
  <sru:echoedScanRequest> 
    <sru:version>1.2</sru:version> 
    <sru:scanClause>cmd.collections</sru:scanClause> 
    <sru:responsePosition></sru:responsePosition> 
    <sru:maximumTerms>42</sru:maximumTerms> 
  </sru:echoedScanRequest> 
</sru:scanResponse>

Additionally the protocol provides the element <sru:extraTermData> for additional information. So one possibility seems to recursively allow the <sru:terms>-list inside this element, yielding following example:

<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/" > 
<sru:version>1.2</sru:version> 
  <sru:terms>  
    <sru:term> 
       <sru:value>#c1</sru:value> 
       <sru:numberOfRecords>1200</sru:numberOfRecords> 
       <sru:displayTerm>Nested Corpus</sru:displayTerm> 
       <sru:extraTermData>
          <fcs:numberOfCollections>2</fcs:numberOfCollections>
          <sru:terms>
             <sru:term>
                <sru:value>#c1-1</sru:value> 
                <sru:numberOfRecords>300</sru:numberOfRecords> 
                <sru:displayTerm>Subcorpus 1</sru:displayTerm> 
             </sru:term>      
             <sru:term>
                <sru:value>#c1-2</sru:value> 
                <sru:numberOfRecords>900</sru:numberOfRecords> 
                <sru:displayTerm>Subcorpus 2</sru:displayTerm> 
             </sru:term>    
          </sru:terms>  
      </sru:extraTermData>  
   </sru:term> 
   <sru:term> 
          <sru:value>MPI1296694#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>Childes corpus</sru:displayTerm> 
   </sru:term>  
  </sru:terms> 
</sru:scanResponse>

Configuration issues

Requirements for the endpoint (as detected when trying to access our endpoints via a SRU-based tool yaz-client):

  • Content-Encoding: text/xml for the responses
  • simple base-path (everything after domain is interpreted as database-name (and slashes are escaped))
    So this works:
    http://corpus3.aac.ac.at/ddconsru
    
    While this does not:
    http://corpus3.aac.ac.at/ddc/sru