wiki:FCS-Endpoints

Warning: This information is deprecated! Please find current information at the Core Specification and supplementary Data Views Specification pages. A up-to-date list of endpoints is available from the Centre Registry. For internal work the taskforce has an additional list of endpoints

FCS Endpoints

This page was started to preliminary collect information about individual repositories/Centres/providers/endpoints within FederatedSearch.

Meanwhile there is the Centre Registry, that collects and provides the information about the available repositories (see below), that will be the sole source of information about the individual endpoints. However, since not all endpoints already provide a description within the Centre Registry and also for reference, for the time being the list of endpoints below is being kept. Also, this page details about issues relevant to FCS-endpoints. (But see also FCS-specification.)

Announcing the endpoints

There are two ways of announcing a FCS-endpoint.

  1. CenterProfile published via the Centre Registry
  2. !SearchService-ResourceProxy in a CMD record of a resource/collection.

CenterProfile in the Centre Registry

In SOA a separate registry-service is required to announce the individual services/endpoints. Within CLARIN the Centre Registry has been introduced to collect and expose organizational/administrational and technical information about individual CLARIN Centres.

This information is encoded as a dedicated CMD record according to the CenterProfile [clarin.eu:cr1:p_1320657629667].

These descriptions shall cover all aspects (especially all available services) of given Centre, to the information about the FCS-endpoint is just a part of the description. Following is a snippet of a CenterProfile linking to the FCS-endpoint:

 <WebReference>
   <Website>http://weblicht.sfs.uni-tuebingen.de/rws/sru/</Website>
   <Description>CQL</Description>
 </WebReference>

(Sidenote: There is a semantic/functional overlap of such a CMD record for the endpoint and the SRU-explain-response of given service.
TODO: have to clarify the interaction and dependencies. )

Discussion: replicating Metadata about resources

There is ongoing discussion, how much information about the resources provided via the endpoint should be included in the endpoint description (foremost example being the Language), as the resources are actually already described in separate CMD records.

So it would be "cleaner" to just link to separate CMD records of a collection or a resource, that would provide information like ResourceType, Language, available AnnotationTiers, Time/Space Coverage etc. However that would put an undue burden on the side of the aggregator/client, that would have to crawl through and resolve multiple metadata records, plus be able to make sense of the heterogeneous structure of the CMD records, to get to the required information.

Thus for now, Language (and possibly a few other basic fields) will be added into the endpoint-description. But there are plans (and prototyping work at Meertens) to combine the metadata and content search, that would allow to filter the content search on any metadata query.

SearchService-ResourceProxy

Another way of linking to a FCS-endpoint (bottom-up) is by adding a SearchService-ResourceProxy into the CMD record of any collection/resource.

The obvious semantics is, that the stated FCS-endpoint provides search capabilities over given resource. However because the SearchService-Proxy leads just to the "nearest" endpoint, which may expose multiple resources, the endpoint has to accept the parameter x-context, to restrict the search to given resource. A client/aggregator invoking a search for given resource, passes the PID of the resource as x-context parameter.

This method is especially important to support the metadata/content search: With the information about the respective !SearchService inside the CMD descriptions of the resources, it will be easy to do any metadata query, extract a list of FCS-endpoints from the resultset, and pass it e.g. to the FCS-Aggregator to continue searching in the content.

Implementing Endpoints

Here is one preliminary list of services available (or planned) within the FederatedSearch:

Provider DB/Collection link type language status
KNAW mimore http://www.meertens.knaw.nl/mimore/srucql/ dialects Lexicon? Dutch dialects online, level 0, open issues, sample query
MPI IMDI-subset http://cqlservlet.mpi.nl/ spoken corpora Dutch, German, English, French, Swedish online, level 0, open issues sample query
INL INL http://gysseling.corpus.taalbanknederlands.inl.nl/gyssru/ text corpus historic Dutch online, level 0, open issues, sample query
INL INL http://brievenalsbuit.inl.nl/zbsru/ text corpus historic Dutch online, level 0, open issues, sample query
DANS Lieffering http://srucql.dans.knaw.nl historical text corpus Dutch, French Online, level 1, no known issues, sample query
ICLTT C4 http://corpus3.aac.ac.at/ddconsru historical text corpus German online, level 0, open issues, sample query
UPF El Pais Newspaper Corpus 2005 http://gilmere.upf.edu/pais_sru text corpus Spanish online, allows index-based queries (e.g. queries only on content ccs.content=... sample query)
Uni Tübingen Tübingen Baumbank des Deutschen - Diachrones Corpus http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/sru text corpus German online, level 0, sample query
IDS Goethe http://clarin.ids-mannheim.de/cosmassru text corpus German online, level 0
sample query enumerate corpora
IDS TextGrid Digital Library (Literature folder) http://clarin.ids-mannheim.de/digibibsru text corpus German online, level 0, highly experimental
sample query enumerate corpora
Uni Leipzig Leipzig Corpora Collection, Deutscher Wortschatz http://clarinws.informatik.uni-leipzig.de:8080/CQL text corpus multiple languages online, level 1, sample query, WIP
BBAW DTA Corpus http://dspin.dwds.de:8088/cgi-bin/FCS-Endpoint/DTA_SRU historical text corpus German online, level 0, sample query
BBAW Dingler Online http://dspin.dwds.de:8088/cgi-bin/FCS-Endpoint/DTA_SRU historical text corpus German online, level 0, sample query
BBAW C4 Corpus http://dspin.dwds.de:8088/DDC-C4/sru reference corpus German online, level 0, sample query
UdS GRUG Treebank http://fedora.clarin-d.uni-saarland.de/sru/ parallel Treebank German online, level 0 sample query
UdS Saarbrücken Cookbook Corpora (SaCoCo?) http://fedora.clarin-d.uni-saarland.de/sru2/ diachronic corpus German online, level 0 sample query
UdS Middle Polish Diachrone Lemmatised Corpus (PolDiLemma?) http://fedora.clarin-d.uni-saarland.de/sru3/ diachronic corpus Polish (1501/1800) online, level 0 sample query
BAS Various http://clarin.phonetik.uni-muenchen.de/BASSRU speech corpora German, English, Japanese online, level 0 sample query

List of corpora per endpoint.

MPI for Psycholinguistics (cqlservlet.mpi.nl)

INL (gysseling.corpus.taalbanknederlands.inl.nl/cqlwebapp/cql)

Candidate Services

Provider DB/Collection link type language status
OTA BNC ? http://ota.oucs.ox.ac.uk/ text corpus English potential
OTA TRACTOR-archive text corpus? Central an East EU potential
BAS Speech Corpora spoken corpora German planned
Gothenburg Uni Spraakbanken http://litteraturbanken.se/ literary texts Swedish? planned?
Gothenburg Uni Spraakbanken http://demosb.spraakdata.gu.se/korp/ text corpus Swedish? planned?
UPF El Pais Newspaper Corpus (Metadata) http://gilmere.upf.edu/girona_sru text corpus Catalan only metadata (date, lang), sample query?
ICLTT CMDI-MDService http://clarin.aac.ac.at/MDService2/sru Metadata multiple langs online, but very sloppy, too different

And following some reference SRU-services:

Provider endpoint info
Library of Congress http://z3950.loc.gov:7090/voyager ?
Oxford English Dictionary http://www.oed.com/srupage info about OED-SRU service
Gutenberg (metadata) provided by indexdata http://opencontent.indexdata.com/gutenberg project Open Content

See also for DE endpoint candidates: http://www.clarin-d.de/mwiki/index.php/Corpora_f%C3%BCr_den_Federated_Content_Search

SRU/CQL conformance testing

A SRU Server Tester for testing basic protocol conformance is available at: http://alcme.oclc.org/srw/SRUServerTester.html

Preliminary CLARIN FCS SRU/CQL tester at (requires authentication): http://clarin.ids-mannheim.de/srutest/

Current Issues

At the time of writing (3th of may) there are some issues with all of the sru/cql implementations. We will list them here per service.

http://www.meertens.knaw.nl/mimore/srucql/

  • Does not implement x-cmd-collections
  • <sru:recordData/> is a closed element instead of wrapping around the <css:Resource>
  • How to generate link to resource containing hit?

http://cqlservlet.mpi.nl/

  • Not XML-ified content in the DataView? element. Should have at least wrapping of elements around the hit.

http://corpus3.aac.ac.at/ddconsru

  • Does not implement scan on cmd.collections
  • How to generate link to resource containing hit?

http://gilmere.upf.edu/girona_sru

  • Encodes searchretrieve responce in DCU format (and not our format)
  • Does not seem to use x-cmd-collections

http://clarin.aac.ac.at/MDService2/sru

  • searchRetrieve returns a completely different xml thing.
  • scan on cmd.collections generates a nullpointerexception

Requirements

Monitoring
an associated service must check regularly the availability of the services. (and perhaps even more) - a simple nagios plugin is available

Further (future) Issues

  • Integration with MD-search
  • Integration with VLO
  • Integration with the Virtual Collection Registry
  • Aggregrator/endpoint.

We note that all three will ideally be integrated in the same manner. For this to happen there are a few conditions that MUST be met:

  • All centres generate nice CMDI
  • This CMDI is of a same granuality as the search contstrainability in the endpoint
  • Everyone includes in the CMDI files a unique set identifier thingy (e.g., the node-id mapping hdl.net thing at the MPI).
  • The endpoints understand the unique identifiers from their corresponding CMDI files!
Last modified 9 years ago Last modified on 02/13/15 13:24:06