FCS Aggregator
One (the core) component of the FederatedSearch Infrastructure is the aggregator - a service that accepts the queries, distributes them onto target repositories, collects and merges the partial results and passes an aggregated result back to the client.
This shall be considered a component related, but separate from a web-application that would allow users to perform a federated search. Such an application will be discussed under EDC-Workbench.
Specification
REST-based service that does the aggregation (aka metasearch).
The main operation is
search( query , search-contexts[] , params[]? )
- query
- CQL-Query
- search-contexts
- list of target repositories to pass the query to. This can be the result of a metadata search, or manual user selection.
It could be encoded as thex-cmd-context
-parameter (however complex issue! see SearchContext)
One question is, if this service should provide the FCS/SRU
protocol itself. While it seems desirable for interface consistency within the federation and would make life easier to clients, it may be difficult to combine with the requirement of feeding data as they come in, i.e. being able to handle partial results.
Interface Mockup
See Mockup
Implementations
MPI's demonstrator aggregator
There is now one quick&dirty implementation by Herman Stehouwer hosted at mpi.nl: http://lux17.mpi.nl/ds/fedsearch/. Although valuable for allowing first aggregated results, this probably shouldn't be directly used as a basis for further development, as it is a very hot-needle solution. It also clumps together the service and the "web application", i.e. the web user interface, which we want to avoid.
The next steps for this prototype would be to split the service and web application functionality into to separate components.
Pazpar2/YAZ
Currently however we concentrate on the very mature and really widely used Zebra/YAZ/Masterkey-framework.
Especially Pazpar2 - the metasearching middleware - seems a great candidate for exactly what we intend. This framework is originally built on the powerful, but complex Z39.50
-protocol, but most components (in particular Pazpar2) talk also SRU/SRW
.
Pazpar2 is able to query SRU-based services, but does not offer SRU-interface itself for the aggregated result, but rather still uses the more powerful (session-bound) Z39.50
protocol, which seems necessary to allow partial results.
Although Pazpar2 is primarily a service, it also comes with a quite usable AJAX-based sample search user interface.
There is currently one instance of Pazpar2 running on the clarin-at server: http://clarin.aac.ac.at/pazpar2/jsdemo1/, connecting to:
This configuration demonstrates following aspects:
- connect to one of FCS-SEs (namely our C4/DDC-service).
- allow combined search in FCS-SEs and other established targets from the SRU/Z39.50 world (e.g. LoC)
- return the content result ( DataView?[@type='kwic'] )
- highlight the keyword in the content-result (this means handling markup in the field-values) (Read more in the next subsection about the tweaks and limitations.)
Pazpar2's usability for FCS
By means of XSLT-preprocessing Pazpar2 can ingest quite any XML-data, however requiring to squeeze them into the typical flat fields-structure (like SOLR/lucene). Pazpar2's internal representation of the records is a list of <pz:metadata type="{field-name}">
-fields transformed in the output into: <md-{field-name}>
-elements.
The main problem however seems to be, that it is not able to pass through XML-data inside the fields.
So while we could map individual <DataView>
s into corresponding custom pazpar2-fields, the content of the fields would have to be simple text.
As we want to convey all kind's of complex (XML-)data-structures in our response, this seems like a blocking issue.
I posted a request regarding this in the yazlist
and plan to further investigate this.
Possible workarounds/solutions:
- xml-content only by reference
-
Example for this approach:
<DataView type="annotation-eaf" ref="{link to Data-file}" /> /* would be converted in Pazpar2's output into: */ <md-annotation-eaf>{link to Data-file}</md-annotation-eaf>
Although in our proposed data model we cater for referencing content as alternative to embedding it (and it would be a good exercise towards treating everything as a (addressable) Resource), having only this possibility is certainly very limiting. For example we wouldn't be able to process (inline) even the basickwic-DataView
(as it shall mark the keyword in the context with<kw>
-element). - escape xml
- make the xml look like text, by escaping the tags. This sounds like an un-nice hack, but it would be easy to implement. Actually it is already implemented in the clarin.aac.at-instance.
- MasterKey Service-Proxy
- another component of the indexdata/Masterkey-Suite, that is designed to interact with (sit on top of) Pazpar2 and extend the service with further functionality. Its modular architecture - plugins forming processing chains - should allow to pre- an postprocess the results to work-around this limitation.
Remarks to Installation of pazpar2/yaz-client on Linux
(on OS: openSuse 11.2)
There were sw-packages available via opensuse-distributions/yast, but they were outdated (3.0.44) (uninstalled those). There are also various packages in the indexdata-repository, but they kept missing some libraries.
So
- downloaded latest sources of yaz-4.1.7 and pazpar2-1.5.6.
- tried simple:
./configure make make install
But when triedyaz-client
,pazpar2
they failed with missing shared library: "error while loading shared libraries: libyaz_icu.so.4:" (although it was available under/usr/local/lib
) - Then tried various configurations (always with
make uninstall
) and finally working:./configure --disable-shared --with-icu --with-xml2 --with-xslt
It disables shared-objects,-xml2
and-xslt
options is said in the docs to be necessary for SRU-support. - setup
edu.xml
as the targets-configuration in `pazpar2/etc/default.xml - started
pazpar2
:cwd:pazpar2/etc> ../src/pazpar2 -f pazpar2.cfg
- copied
pazpar2/www/test1,jsdemo
to/srv/www/htdocs
- added
ReverseProxy
in apache setup (inhttpd.conf.local
)
(see also pazpar2-docs#apache2proxyProxyPass /pazpar2 http://localhost:9004 ProxyVia Off ProxyPassReverse /pazpar2 http://localhost:9004 ProxyPassReverseCookieDomain localhost corpus5.aac.oeaw.ac.at ProxyPassReverseCookiePath / /pazpar2
- try under: http://clarin.aac.ac.at/pazpar2/jsdemo1/
Attachments (1)
-
screen_clarinPazpar2_fcs.png (65.7 KB) - added by 13 years ago.
screenshot of the Pazpar2 demo client with a FCS-client and KWIC-display
Download all attachments as: .zip