wiki:FCS-Endpoints

Context Navigation

Version 36 (modified by dietuyt, 12 years ago) (diff)
--

Repository Registry

We need a service/registry that informs about the available repositories.

Until we have that formalized, we will collect the information about individual providers here.

Implementing Endpoints

Here is one preliminary list of services available (or planned) within the FederatedSearch - EDC

Provider	DB/Collection	link	type	language	status
KNAW	mimore	http://www.meertens.knaw.nl/mimore/srucql/	dialects Lexicon?	Dutch dialects	online, level 0, open issues, sample query
MPI	IMDI-subset	http://cqlservlet.mpi.nl/	spoken corpora	Dutch, German, English, French, Swedish	online, level 0, open issues
INL	INL?	http://gysseling.corpus.taalbanknederlands.inl.nl/cqlwebapp/cql	text corpus	historic Dutch	online, level 0, open issues, sample query
DANS	Lieffering	http://srucql.dans.knaw.nl	historical text corpus	Dutch, French	Online, level 1, no known issues, sample query
ICLTT	C4	http://corpus3.aac.ac.at/ddconsru	historical text corpus	German	online, level 0, open issues, sample query
UPF	El Pais Newspaper Corpus 2005	http://gilmere.upf.edu/pais_sru	text corpus	Spanish	online, allows index-based queries (e.g. queries only on content `ccs.content=...` sample query)
Uni Tübingen	Tübingen Baumbank des Deutschen - Diachrones Corpus	http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/sru	text corpus	German	online, level 0, sample query
IDS	Goethe	http://clarin.ids-mannheim.de/cosmassru	text corpus	German	online, level 0, sample query
IDS	TextGrid Digital Library (Literature folder)	http://clarin.ids-mannheim.de/digibibsru	text corpus	German	online, level 0, sample query, experimental
Uni Leipzig	Leipzig Corpora Collection, Deutscher Wortschatz	http://clarinws.informatik.uni-leipzig.de:8080/CQL	text corpus	multiple languages	online, level 1, sample query, WIP

List of corpora per endpoint.

MPI for Psycholinguistics (cqlservlet.mpi.nl)

CGN (metadata, design): First language speakers (Dutch), spoken, 9 mio tokens
ESF (metadata, design): Second language learners (Dutch, German, English, French, Swedish), spoken, at least 500.000 tokens
IFA (metadata, design): Dutch (hand-segmented), spoken, 50.000 tokens
Childes (metadata, design): Child language of talkbank, spoken, about 43 languages, 28.000 CHAT files, millions of tokens
Talkbank, (metadata, design), spoken language, about 37 languages, 14.000 CHAT files, a few million tokens

Candidate Services

Provider	DB/Collection	link	type	language	status
OTA	BNC ?	http://ota.oucs.ox.ac.uk/	text corpus	English	potential
OTA	TRACTOR-archive		text corpus?	Central an East EU	potential
BAS	Speech Corpora		spoken corpora	German	planned
Gothenburg Uni	Spraakbanken	http://litteraturbanken.se/	literary texts	Swedish?	planned?
Gothenburg Uni	Spraakbanken	http://demosb.spraakdata.gu.se/korp/	text corpus	Swedish?	planned?
UPF	El Pais Newspaper Corpus (Metadata)	http://gilmere.upf.edu/girona_sru	text corpus	Catalan	only metadata (date, lang), sample query?
ICLTT	CMDI-MDService	http://clarin.aac.ac.at/MDService2/sru	Metadata	multiple langs	online, but very sloppy, too different

And following some reference SRU-services:

Provider	endpoint	info
Library of Congress	http://z3950.loc.gov:7090/voyager	?
Oxford English Dictionary	http://www.oed.com/srupage	info about OED-SRU service
Gutenberg (metadata) provided by indexdata	http://opencontent.indexdata.com/gutenberg	project Open Content

See also for DE endpoint candidates: http://www.clarin-d.de/mwiki/index.php/Corpora_f%C3%BCr_den_Federated_Content_Search

A SRU Server Tester for testing basic protocol conformance is available at: http://alcme.oclc.org/srw/SRUServerTester.html

Current Issues

At the time of writing (3th of may) there are some issues with all of the sru/cql implementations. We will list them here per service.

http://www.meertens.knaw.nl/mimore/srucql/

Does not implement x-cmd-collections
<sru:recordData/> is a closed element instead of wrapping around the <css:Resource>
How to generate link to resource containing hit?

http://cqlservlet.mpi.nl/

Not XML-ified content in the DataView? element. Should have at least wrapping of elements around the hit.

http://corpus3.aac.ac.at/ddconsru

Does not implement scan on cmd.collections
How to generate link to resource containing hit?

http://gilmere.upf.edu/girona_sru

Encodes searchretrieve responce in DCU format (and not our format)
Does not seem to use x-cmd-collections

http://clarin.aac.ac.at/MDService2/sru

searchRetrieve returns a completely different xml thing.
scan on cmd.collections generates a nullpointerexception

http://ds.dev.clarin.inl.nl/cqlwebapp/cql

Doesn't seem to return any results.

Requirements

Monitoring: an associated service must check regularly the availability of the services. (and perhaps even more) - a simple nagios plugin is available

Ideas on implementation of the Registry

As already the above list suggests, simple list of endpoints will not be enough. Even if we try to rely as much as possible on the auto-configuration based on the explain record, additional information (Metadata) needs to be stored and provided.

So one obvious choice would be to define a CMD-Profile for the Repositories. This profile would carry only the minimal necessary primarily technical information. In particular it shouldn't provide any information about the data provided, but rather link to separate MDRecords of a collection or a resource, that would provide information like ResourceType, Language, available AnnotationTiers, Time/Space Coverage etc.

We also have to beware of a big semantic/functional overlap of such a Repository-MDRecord and the SRU-explain-response of given service and TODO: have to clarify the interaction and dependencies.

Following is a proposal of a sample Repository-instance:

 <CMD>
   <Resources>
     <ResourceProxyList> <!-- should refer to collections or resources that are reachable by given Repository. -->
       <ResourceProxy id="mpi-subcorpus">
         <ResourceType>Metadata</ResourceType>
         <ResourceRef>{collection-handle}</ResourceRef>
            </ResourceProxy>
        </ResourceProxyList>
   </Resources>
   <Components>
     <Repository>
        <GeneralInfo>
           <ID>?</ID>
           <Name>MPI corpus</Name>
           <Description>MPI Corpora: ESF, CGN, ...</Description>
        </GeneralInfo>
       <Endpoint>
         <URL>http://cqlservlet.mpi.nl/</URL>
         <type>SRU</type>
         <Views>
            <view>text</view>
         </Views>
       </Endpoint>
       <Endpoint>
         <URL>http://corpus1.mpi.nl/ds/imdi_browser/</URL>
         <type>WebApp/User Interface</type>
       </Endpoint>
      </Repository>
   </Components>
 </CMD>

indexdata solution

Indexdata/Masterkey? framework (from which we are exampine the Pazpar2-component as aggregator) provides IRSpy (GPL) and Torus (seems proprietary).

Further (future) Issues

Integration with MD-search
Integration with VLO
Integration with the Virtual Collection Registry
Aggregrator/endpoint.

We note that all three will ideally be integrated in the same manner. For this to happen there are a few conditions that MUST be met:

All centres generate nice CMDI
This CMDI is of a same granuality as the search contstrainability in the endpoint
Everyone includes in the CMDI files a unique set identifier thingy (e.g., the node-id mapping hdl.net thing at the MPI).
The endpoints understand the unique identifiers from their corresponding CMDI files!

Download in other formats:

Plain Text