wiki:FCS-Endpoints

Version 26 (modified by akislev, 12 years ago) (diff)

--

Repository Registry

We need a service/registry that informs about the available repositories.

Until we have that formalized, we will collect the information about individual providers here.

Candidate Services, Implementing Endpoints

Here is one preliminary list of services available (or planned) within the FederatedSearch - EDC

Provider DB/Collection link type language status
KNAW mimore http://www.meertens.knaw.nl/mimore/srucql/ dialects Lexicon? Dutch dialects? online, level 0, open issues, sample query
MPI IMDI-subset http://cqlservlet.mpi.nl/ spoken corpus Dutch? online, level 0, open issues
INL INL? http://ds.dev.clarin.inl.nl/cqlwebapp/cql text corpus? Dutch? online, level 0, open issues, sample query
DANS Lieffering http://srucql.dans.knaw.nl historical text corpus Dutch, French Online, level 1, no known issues, sample query
ICLTT C4 http://corpus3.aac.ac.at/ddconsru historical text corpus German online, level 0, open issues, sample query
UPF El Pais Newspaper Corpus (Metadata) http://gilmere.upf.edu/girona_sru text corpus Catalan only metadata (date, lang), sample query?
UPF El Pais Newspaper Corpus 2005 http://gilmere.upf.edu/pais_sru text corpus Spanish online, allows index-based queries (e.g. queries only on content ccs.content=... sample query)
Uni Tübingen Tübingen Baumbank des Deutschen - Diachrones Corpus http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/tueba-ddc text corpus German online, level 0, sample query
OTA BNC ? http://ota.oucs.ox.ac.uk/ text corpus English potential
OTA TRACTOR-archive text corpus? Central an East EU potential
BAS Speech Corpora spoken corpora German planned
IDS Goethe http://clarin.ids-mannheim.de/cosmassru text corpus German online, level 0, sample query
IDS TextGrid Digital Library (Literature folder) http://clarin.ids-mannheim.de/digibibsru text corpus German online, level 0, sample query, experimental
Gothenburg Uni Spraakbanken http://litteraturbanken.se/ literary texts Swedish? planned?
Gothenburg Uni Spraakbanken http://demosb.spraakdata.gu.se/korp/ text corpus Swedish? planned?
ICLTT CMDI-MDService http://clarin.aac.ac.at/MDService2/sru Metadata multiple langs online, but very sloppy, too different
Uni Leipzig Leipzig Corpora Collection, Deutscher Wortschatz http://clarinws.informatik.uni-leipzig.de:8080/CQL text corpus multiple languages online, level 1, sample query, WIP

And following some reference SRU-services:

Provider endpoint info
Library of Congress http://z3950.loc.gov:7090/voyager ?
Oxford English Dictionary http://www.oed.com/srupage info about OED-SRU service
Gutenberg (metadata) provided by indexdata http://opencontent.indexdata.com/gutenberg project Open Content

A SRU Server Tester for testing basic protocol conformance is available at: http://alcme.oclc.org/srw/SRUServerTester.html

Current Issues

At the time of writing (3th of may) there are some issues with all of the sru/cql implementations. We will list them here per service.

http://www.meertens.knaw.nl/mimore/srucql/

  • Does not implement x-cmd-collections
  • <sru:recordData/> is a closed element instead of wrapping around the <css:Resource>
  • How to generate link to resource containing hit?

http://cqlservlet.mpi.nl/

  • Not XML-ified content in the DataView? element. Should have at least wrapping of elements around the hit.

http://corpus3.aac.ac.at/ddconsru

  • Does not implement scan on cmd.collections
  • How to generate link to resource containing hit?

http://gilmere.upf.edu/girona_sru

  • Encodes searchretrieve responce in DCU format (and not our format)
  • Does not seem to use x-cmd-collections

http://clarin.aac.ac.at/MDService2/sru

  • searchRetrieve returns a completely different xml thing.
  • scan on cmd.collections generates a nullpointerexception

http://ds.dev.clarin.inl.nl/cqlwebapp/cql

  • Doesn't seem to return any results.

Requirements

Monitoring
an associated service must check regularly the availability of the services. (and perhaps even more)

Ideas on implementation of the Registry

As already the above list suggests, simple list of endpoints will not be enough. Even if we try to rely as much as possible on the auto-configuration based on the explain record, additional information (Metadata) needs to be stored and provided.

So one obvious choice would be to define a CMD-Profile for the Repositories. This profile would carry only the minimal necessary primarily technical information. In particular it shouldn't provide any information about the data provided, but rather link to separate MDRecords of a collection or a resource, that would provide information like ResourceType, Language, available AnnotationTiers, Time/Space Coverage etc.

We also have to beware of a big semantic/functional overlap of such a Repository-MDRecord and the SRU-explain-response of given service and TODO: have to clarify the interaction and dependencies.

Following is a proposal of a sample Repository-instance:

 <CMD>
   <Resources>
     <ResourceProxyList> <!-- should refer to collections or resources that are reachable by given Repository. -->
       <ResourceProxy id="mpi-subcorpus">
         <ResourceType>Metadata</ResourceType>
         <ResourceRef>{collection-handle}</ResourceRef>
            </ResourceProxy>
        </ResourceProxyList>
   </Resources>
   <Components>
     <Repository>
        <GeneralInfo>
           <ID>?</ID>
           <Name>MPI corpus</Name>
           <Description>MPI Corpora: ESF, CGN, ...</Description>
        </GeneralInfo>
       <Endpoint>
         <URL>http://cqlservlet.mpi.nl/</URL>
         <type>SRU</type>
         <Views>
            <view>text</view>
         </Views>
       </Endpoint>
       <Endpoint>
         <URL>http://corpus1.mpi.nl/ds/imdi_browser/</URL>
         <type>WebApp/User Interface</type>
       </Endpoint>
      </Repository>
   </Components>
 </CMD>

indexdata solution

Indexdata/Masterkey? framework (from which we are exampine the Pazpar2-component as aggregator) provides IRSpy (GPL) and Torus (seems proprietary).

Further (future) Issues

  • Integration with MD-search
  • Integration with VLO
  • Integration with the Virtual Collection Registry
  • Aggregrator/endpoint.

We note that all three will ideally be integrated in the same manner. For this to happen there are a few conditions that MUST be met:

  • All centres generate nice CMDI
  • This CMDI is of a same granuality as the search contstrainability in the endpoint
  • Everyone includes in the CMDI files a unique set identifier thingy (e.g., the node-id mapping hdl.net thing at the MPI).
  • The endpoints understand the unique identifiers from their corresponding CMDI files!