wiki:FCS-spec

Version 38 (modified by vronk, 13 years ago) (diff)

added kwic-DataView? alternatives

Towards a Specification of the FCS-API for individual Content Providers

Target Audience: Technical Staff of Content Providers

Specification of the interface that Content Provider willing to join the federated search have to implement. This builds upon the SRU/CQL protocol, but concentrates mainly on the specific agreements on top of the protocol.

While this page discuss details of individual features, FCS-FeatureMatrix is a more compact yet (to be) complete summary of the features to be implemented.

SRU basics

(We have to ensure, that our specification is compatible with the established implementation and usage of the protocol.)

Context Sets

The protocol allows to define own "context sets" (~ namespace) to bind new indices, relations and operators to. A number of context sets is already defined, at least some of which we should also support (e.g. dublincore).

We propose further context sets to accomodate our special needs:

isocat - isocat.org/datcat
for Data Categories defined in ISOcat Data Category Registry
ccs - clarin.eu/schema/ccs-v1.0
CLARIN Content Search - for indices on content (Annotation Tiers)
cmd - clarin.eu/schema/cmd-v1.0
CLARIN/Component Metadata - for metadata based indices.

It needs to be elaborated further how to integrate existing context sets providing metadata-fields like dublincore I.e. what would be the relation between cmd and dc context sets. The problem is, that CMD shall be the context set for all the indices that are thinkable/usable based on the CMD-Profiles. And for example dublincore is also defined as a profile in CMD. But on the other hand dublincore is a model/format on its own, is widely used in the established federated search world (libraries, harvesting etc.), and in particular it already has its own context set in SRU/CQL. So it seems inacceptable to force the repositories to recode this (dc: to cmd:).
The question seems to be if dc.title and cmd.dc.title can be seen as equivalent. (While it is clear that cmd.title is not strictly equivalent but rather an (ambiguous) superset, because it would mean title-element from all profiles.)

Explain operation

This basic request serves to announce server's capabilities and should allow the client to configure itself automatically.

The explain response should, ideally, provide a list of ISOCatted indexes as possible search indexes. If there is no ISOCat equivalent the CCS-context set is to be used.

Example (tentative):

 <indexInfo>
  <set identifier="isocat.org/datcat" name="isocat"/>
  <set identifier="clarin.eu/schema/ccs-v1.0" name="ccs"/>

  <index id="?">
    <title lang="en">Part of Speech</title>
   <map><name set="isocat">partOfSpeech</name></map>
 </index>	

 <index id="?">
    <title lang="en">Words</title>
   <map><name set="ccs">words</name></map>
 </index>	

  <index id="?">
    <title lang="en">Phonetics</title>
   <map><name set="ccs">phonetics</name></map>
 </index>	

</indexInfo>

searchRetrieve operation

The operation to send the actual query.

Search result

The common ground is the <sru:searchRetrieveResponse> defined by the protocol. This goes down to the <sru:record> wrapping element. The proposition is to continue with a generic structure being able to encompass "all" the various types of information. (But please also look at the draft of the schema FederatedSearch/ccsResource.xsd):

 <ccs:Resource pid="{pid of the resource}“>  
    <ccs:Metadata cmd-link="{PID of the CMD-record}">  /* cmd-link is optional */
        /* this is for metadata provided directly by the content provider 
         * NOT the CMD  metadata.  */
      <ccs:f key="{any metadata-key}">{any metadata value}</ccs:f> /* is this any useful? */
        /* or rather direct metadata-fields like: */
       <dc:title></dc:title>
       <dc:author></dc:author>
       ....
    </ccs:Metadata>
    <ccs:ResourceFragment pid="{offset relative to the parent resource PID}“>    	
       <ccs:Metadata>   /* metadata pertaining to the specific (matching) fragment, like metadata on the "current" speaker */
       </ccs:Metadata>

      <ccs:DataView type="kwic">Some text with <kw>keyword</kw> highlighted</ccs:DataView>		
    
      <ccs:DataView type="text/xml"><meertens:any/>     
      </ccs:DataView>		

      <ccs:DataView type="image/jpeg" href="{optional link  to the data}" >
      </ccs:DataView>  
    </ccs:ResourceFragment>
 </ccs:Resource>
Resource
element representing a resource, carrying the identifier. It may represent anything that has a PID (and a MDRecord). So in particular it may also be collections, aggregating other Resources. Allowed children are: Resource, ResourceFragment, Metadata and DataView
ResourceFragment?
A part of a resource, without own PID, i.e. something addressable with: PID of the Resource + Fragment Identifier. Fragment Identifier to be used depends on the resource type, it may be: XPointer, timecode, sequence-offset, etc. Allowed children are: Metadata and DataView
DataView?
the element carrying the typed data Content can be anything that is in other namespace. The content has to be possible inline or referenced. Important for Images and AV-Files.
Metadata
optional element carrying metadata about the Resource or ResourceFragment. It can carry an optional parameter cmd-link with the PID of a CMD-record. (This only makes sense for Resource/Metadata)

Although the original idea was to "serialize" all such metadata-fields in a <f key="{field-name}">-element, I now prefer reusing existing namespaces.

<dc:title> seems preferable to <f key="dc:title">, right?

However this nested approach seems not directly compatible with the established SRU-based systems, that rather work on flat fields. And while this can be overcome by providing converter XSL-stylesheets, the information we need seems expressable in a flat structure as well, that makes the more complex (nested) approach questionable:

<sru:recordData>
   <ccs:ResourcePID>{PID of the resource}</ccs:ResourcePID> 
   <ccs:ResourceFragmentPID>{identifier of the resource-fragment (relative to Resource-PID?)}</ccs:ResourcePID> 
   <ccs:CMDPID>{PID of the CMD-record}</ccs:CMDPID>  /* optional */

   <dc:title>{title of the resource}</dc:title>  
     /* basically any metadata-fields as is standard in SRU-world */
   ... 
  
   <ccs:DataView type="kwic">Some text with <kw>keyword</kw> highlighted</ccs:DataView>		
   <ccs:DataView type="text/xml"><meertens:any/>     
    </ccs:DataView>		
   <ccs:DataView type="image/jpeg" href="{optional link  to the data}" ></ccs:DataView>  
</sru:recordData>

Data Views

Here we propose several types of DataViews?, the actual format has to be yet defined for most of them, but we should reuse existing formats where possible, so we should look at existing practices and data, but at the same time avoid overspecializing on some specific format. (The CLARIN deliverable Interoperability and Standards (2,7 MB) can be used as a starting point.) Discussion about the relationship between data types and corresponding "Viewers", i.e. means of displaying the information to the user under Viewable.

All dataviews of specific types have to be the same in all implementations. That is, if a service presents results as KWIC, that should be the same KWIC in all services.

kwic
Keyword in context
<ccs:DataView type="kwic">
   Junker Frauenlob , purre knix plautz - Ihr seid ein komischer Kauz - Habt ein Bärtlein von Haaren schwarz , 
   Ziehet es aus mit einem Tropfen Harz Prrrr - ho wird das lang , Kling klang - g - a - d -
   <kw>e</kw> , Scheiden thut weh - der Daus
</ccs:DataView>
Alternatively - to avoid mixed content - the context could be enclosed in separate element as well:
 <ccs:DataView type="kwic"><c>Some text with </c><kw>keyword</kw> <c>highlighted</c></ccs:DataView>		
Or in the extreme form, every token is wrapped in an element:
 <ccs:DataView type="kwic">
     <t id="t1">Some</t> 
     <t id="t2">text</t> 
     <t id="t3" >with</t> 
     <t id="t4" kw="1">keyword</t> 
     <t id="t5">highlighted</t>
 </ccs:DataView>		
This comes close to the way the text is encoded in TCF and would accordingly allow to add (stand-off) annotation layers (lemma, POS, but also syntactic annotations).

If there is some associated metadata (like bibliographic information about the source of the hit, this is to be encoded in a separate element <ccs:Metadata>.

Geographic data
A geographic location, either as coordinates or some location (street, city, place). One established format is KML
Lexicon Entry
A entry from an lexicon, dictionary or similar. Something with lemma with some information about it. There are well established format for lexical and terminological data like Lexical Markup Framework (LMF) or Terminological Markup Framework (TMF - ISO16642).
List
A list of things. These are not primary resources, but rather derived information usually aggregations / frequency lists. This is similar to the scan-operation, or in other words: the result of a scan-operation is also such a list.
  [<key, number, link?>]
Example:
  Haus  45
  Liebe 60
This should enclose also nested lists
Matrix
A matrix containing things. Table as a special type of matrix? Multidimensional? To be defined.
Annotated Text
A bunch of annotated text. We start by supporting the TCF and EAF format as they have existing viewers. For an example EAF file see: sample file For now we have annotations/eaf as type and annotations/tcf as type.
<ccs:DataView type="annotations/eaf" ref="http://corpus1.mpi.nl/qfs1/media-archive/demo/pewi/Annotations/elan-example1.eaf"
</ccs:DataView>
Syntax tree
A special type of annotation. There are dedicated formats for syntactic annotation (Penn Treebank, NeGra? Format, SynAF). TCF can also describe syntax trees.

restricting the search by collections

Restricting the search space shall be done via x-cmd-domain (or x-cmd-context?) parameter (obsoleting: x-cmd-collections). See more under SearchContext

Scan operation

As an extension to normal SRU the scan response defines a list of searchable collections/domains available at the provider. As a scanClause argument cmd.domains should be used.

Example (tentative):

<sru:scanResponse xmlns:sru="http://www.loc.gov/zing/srw/"
          xmlns:diag="http://www.loc.gov/zing/srw/diagnostic/"
          xmlns:myServer="http://myServer.com/"> 
<sru:version>1.2</sru:version> 
  <sru:terms> 
 
    <sru:term> 
          <sru:value>MPI86949#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>The CGN-Corpus (Corpus Gesproken Nederlands)</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>MPI556280#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>ESF corpus</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>MPI214746#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>IFA corpus</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>MPI1296694#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>Childes corpus</sru:displayTerm> 
    </sru:term> 
    <sru:term> 
          <sru:value>MPI1259419#</sru:value> 
          <sru:numberOfRecords>42</sru:numberOfRecords> 
          <sru:displayTerm>Talkbank corpus</sru:displayTerm> 
    </sru:term> 
 
  </sru:terms> 
  <sru:echoedScanRequest> 
    <sru:version>1.2</sru:version> 
    <sru:scanClause>cmd.collections</sru:scanClause> 
    <sru:responsePosition></sru:responsePosition> 
    <sru:maximumTerms>42</sru:maximumTerms> 
  </sru:echoedScanRequest> 
</sru:scanResponse>

Configuration issues

Requirements for the endpoint (as detected when trying to access our endpoints via a SRU-based tool yaz-client):

  • Content-Encoding: text/xml for the responses
  • simple base-path (everything after domain is interpreted as database-name (and slashes are escaped))
    So this works:
    http://corpus3.aac.ac.at/ddconsru
    
    While this does not:
    http://corpus3.aac.ac.at/ddc/sru