wiki:FCS-Aggregator

FCS Aggregator

One (the core) component of the FederatedSearch Infrastructure is the aggregator - a service that accepts the queries, distributes them onto target repositories, collects and merges the partial results and passes an aggregated result back to the client.

This shall be considered a component related, but separate from a web-application that would allow users to perform a federated search. Such an application will be discussed under EDC-Workbench.

Specification

REST-based service that does the aggregation (aka metasearch).

The main operation is

  search(  query ,  search-contexts[] ,  params[]? )
query
CQL-Query
search-contexts
list of target repositories to pass the query to. This can be the result of a metadata search, or manual user selection.
It could be encoded as the x-cmd-context-parameter (however complex issue! see SearchContext)

One question is, if this service should provide the FCS/SRU protocol itself. While it seems desirable for interface consistency within the federation and would make life easier to clients, it may be difficult to combine with the requirement of feeding data as they come in, i.e. being able to handle partial results.

Interface Mockup

See Mockup

Implementations

MPI's demonstrator aggregator

There is now one quick&dirty implementation by Herman Stehouwer hosted at mpi.nl: http://lux17.mpi.nl/ds/fedsearch/. Although valuable for allowing first aggregated results, this probably shouldn't be directly used as a basis for further development, as it is a very hot-needle solution. It also clumps together the service and the "web application", i.e. the web user interface, which we want to avoid.

The next steps for this prototype would be to split the service and web application functionality into to separate components.

Pazpar2/YAZ

Currently however we concentrate on the very mature and really widely used Zebra/YAZ/Masterkey-framework. Especially Pazpar2 - the metasearching middleware - seems a great candidate for exactly what we intend. This framework is originally built on the powerful, but complex Z39.50-protocol, but most components (in particular Pazpar2) talk also SRU/SRW.

Pazpar2 is able to query SRU-based services, but does not offer SRU-interface itself for the aggregated result, but rather still uses the more powerful (session-bound) Z39.50 protocol, which seems necessary to allow partial results.

Although Pazpar2 is primarily a service, it also comes with a quite usable AJAX-based sample search user interface.

screenshot of the Pazpar2 demo client with a FCS-client and KWIC-display There is currently one instance of Pazpar2 running on the clarin-at server: http://clarin.aac.ac.at/pazpar2/jsdemo1/, connecting to:

This configuration demonstrates following aspects:

  1. connect to one of FCS-SEs (namely our C4/DDC-service).
  2. allow combined search in FCS-SEs and other established targets from the SRU/Z39.50 world (e.g. LoC)
  3. return the content result ( DataView?[@type='kwic'] )
  4. highlight the keyword in the content-result (this means handling markup in the field-values) (Read more in the next subsection about the tweaks and limitations.)

Pazpar2's usability for FCS

By means of XSLT-preprocessing Pazpar2 can ingest quite any XML-data, however requiring to squeeze them into the typical flat fields-structure (like SOLR/lucene). Pazpar2's internal representation of the records is a list of <pz:metadata type="{field-name}">-fields transformed in the output into: <md-{field-name}>-elements.

The main problem however seems to be, that it is not able to pass through XML-data inside the fields. So while we could map individual <DataView>s into corresponding custom pazpar2-fields, the content of the fields would have to be simple text. As we want to convey all kind's of complex (XML-)data-structures in our response, this seems like a blocking issue. I posted a request regarding this in the yazlist and plan to further investigate this.

Possible workarounds/solutions:

xml-content only by reference
Example for this approach:
 <DataView type="annotation-eaf" ref="{link to Data-file}" />
   /* would be converted in Pazpar2's output into: */
 <md-annotation-eaf>{link to Data-file}</md-annotation-eaf>
Although in our proposed data model we cater for referencing content as alternative to embedding it (and it would be a good exercise towards treating everything as a (addressable) Resource), having only this possibility is certainly very limiting. For example we wouldn't be able to process (inline) even the basic kwic-DataView (as it shall mark the keyword in the context with <kw>-element).
escape xml
make the xml look like text, by escaping the tags. This sounds like an un-nice hack, but it would be easy to implement. Actually it is already implemented in the clarin.aac.at-instance.
MasterKey Service-Proxy
another component of the indexdata/Masterkey-Suite, that is designed to interact with (sit on top of) Pazpar2 and extend the service with further functionality. Its modular architecture - plugins forming processing chains - should allow to pre- an postprocess the results to work-around this limitation.

Remarks to Installation of pazpar2/yaz-client on Linux

(on OS: openSuse 11.2)

There were sw-packages available via opensuse-distributions/yast, but they were outdated (3.0.44) (uninstalled those). There are also various packages in the indexdata-repository, but they kept missing some libraries.

So

  1. downloaded latest sources of yaz-4.1.7 and pazpar2-1.5.6.
  2. tried simple:
     ./configure 
     make  
     make install
    
    But when tried yaz-client, pazpar2 they failed with missing shared library: "error while loading shared libraries: libyaz_icu.so.4:" (although it was available under /usr/local/lib)
  3. Then tried various configurations (always with make uninstall) and finally working:
     ./configure --disable-shared --with-icu --with-xml2 --with-xslt
    
    It disables shared-objects, -xml2 and -xslt options is said in the docs to be necessary for SRU-support.
  4. setup edu.xml as the targets-configuration in `pazpar2/etc/default.xml
  5. started pazpar2:
     cwd:pazpar2/etc> ../src/pazpar2 -f pazpar2.cfg
    
  6. copied pazpar2/www/test1,jsdemo to /srv/www/htdocs
  7. added ReverseProxy in apache setup (in httpd.conf.local)
    (see also pazpar2-docs#apache2proxy
     ProxyPass /pazpar2 http://localhost:9004
     ProxyVia Off
     ProxyPassReverse /pazpar2 http://localhost:9004
     ProxyPassReverseCookieDomain localhost corpus5.aac.oeaw.ac.at
     ProxyPassReverseCookiePath / /pazpar2
    
  8. try under: http://clarin.aac.ac.at/pazpar2/jsdemo1/
Last modified 12 years ago Last modified on 07/26/12 10:06:12

Attachments (1)

Download all attachments as: .zip