Version 26 (modified by 12 years ago) (diff) | ,
---|
Repository Registry
We need a service/registry that informs about the available repositories.
Until we have that formalized, we will collect the information about individual providers here.
Candidate Services, Implementing Endpoints
Here is one preliminary list of services available (or planned) within the FederatedSearch - EDC
Provider | DB/Collection | link | type | language | status |
---|---|---|---|---|---|
KNAW | mimore | http://www.meertens.knaw.nl/mimore/srucql/ | dialects Lexicon? | Dutch dialects? | online, level 0, open issues, sample query |
MPI | IMDI-subset | http://cqlservlet.mpi.nl/ | spoken corpus | Dutch? | online, level 0, open issues |
INL | INL? | http://ds.dev.clarin.inl.nl/cqlwebapp/cql | text corpus? | Dutch? | online, level 0, open issues, sample query |
DANS | Lieffering | http://srucql.dans.knaw.nl | historical text corpus | Dutch, French | Online, level 1, no known issues, sample query |
ICLTT | C4 | http://corpus3.aac.ac.at/ddconsru | historical text corpus | German | online, level 0, open issues, sample query |
UPF | El Pais Newspaper Corpus (Metadata) | http://gilmere.upf.edu/girona_sru | text corpus | Catalan | only metadata (date, lang), sample query? |
UPF | El Pais Newspaper Corpus 2005 | http://gilmere.upf.edu/pais_sru | text corpus | Spanish | online, allows index-based queries (e.g. queries only on content ccs.content=... sample query)
|
Uni Tübingen | Tübingen Baumbank des Deutschen - Diachrones Corpus | http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/tueba-ddc | text corpus | German | online, level 0, sample query |
OTA | BNC ? | http://ota.oucs.ox.ac.uk/ | text corpus | English | potential |
OTA | TRACTOR-archive | text corpus? | Central an East EU | potential | |
BAS | Speech Corpora | spoken corpora | German | planned | |
IDS | Goethe | http://clarin.ids-mannheim.de/cosmassru | text corpus | German | online, level 0, sample query |
IDS | TextGrid Digital Library (Literature folder) | http://clarin.ids-mannheim.de/digibibsru | text corpus | German | online, level 0, sample query, experimental |
Gothenburg Uni | Spraakbanken | http://litteraturbanken.se/ | literary texts | Swedish? | planned? |
Gothenburg Uni | Spraakbanken | http://demosb.spraakdata.gu.se/korp/ | text corpus | Swedish? | planned? |
ICLTT | CMDI-MDService | http://clarin.aac.ac.at/MDService2/sru | Metadata | multiple langs | online, but very sloppy, too different |
Uni Leipzig | Leipzig Corpora Collection, Deutscher Wortschatz | http://clarinws.informatik.uni-leipzig.de:8080/CQL | text corpus | multiple languages | online, level 1, sample query, WIP |
And following some reference SRU-services:
Provider | endpoint | info |
Library of Congress | http://z3950.loc.gov:7090/voyager | ? |
Oxford English Dictionary | http://www.oed.com/srupage | info about OED-SRU service |
Gutenberg (metadata) provided by indexdata | http://opencontent.indexdata.com/gutenberg | project Open Content |
A SRU Server Tester for testing basic protocol conformance is available at: http://alcme.oclc.org/srw/SRUServerTester.html
Current Issues
At the time of writing (3th of may) there are some issues with all of the sru/cql implementations. We will list them here per service.
http://www.meertens.knaw.nl/mimore/srucql/
- Does not implement x-cmd-collections
<sru:recordData/>
is a closed element instead of wrapping around the <css:Resource>- How to generate link to resource containing hit?
http://cqlservlet.mpi.nl/
- Not XML-ified content in the DataView? element. Should have at least wrapping of elements around the hit.
http://corpus3.aac.ac.at/ddconsru
- Does not implement scan on cmd.collections
- How to generate link to resource containing hit?
http://gilmere.upf.edu/girona_sru
- Encodes searchretrieve responce in DCU format (and not our format)
- Does not seem to use x-cmd-collections
http://clarin.aac.ac.at/MDService2/sru
- searchRetrieve returns a completely different xml thing.
- scan on cmd.collections generates a nullpointerexception
http://ds.dev.clarin.inl.nl/cqlwebapp/cql
- Doesn't seem to return any results.
Requirements
- Monitoring
- an associated service must check regularly the availability of the services. (and perhaps even more)
Ideas on implementation of the Registry
As already the above list suggests, simple list of endpoints will not be enough. Even if we try to rely as much as possible on the auto-configuration based on the explain record, additional information (Metadata) needs to be stored and provided.
So one obvious choice would be to define a CMD-Profile for the Repositories.
This profile would carry only the minimal necessary primarily technical information. In particular it shouldn't provide any information about the data provided, but rather link to separate MDRecords of a collection or a resource, that would provide information like ResourceType
, Language
, available AnnotationTiers
, Time/Space Coverage
etc.
We also have to beware of a big semantic/functional overlap of such a Repository
-MDRecord and the SRU-explain
-response of given service and TODO: have to clarify the interaction and dependencies.
Following is a proposal of a sample Repository
-instance:
<CMD> <Resources> <ResourceProxyList> <!-- should refer to collections or resources that are reachable by given Repository. --> <ResourceProxy id="mpi-subcorpus"> <ResourceType>Metadata</ResourceType> <ResourceRef>{collection-handle}</ResourceRef> </ResourceProxy> </ResourceProxyList> </Resources> <Components> <Repository> <GeneralInfo> <ID>?</ID> <Name>MPI corpus</Name> <Description>MPI Corpora: ESF, CGN, ...</Description> </GeneralInfo> <Endpoint> <URL>http://cqlservlet.mpi.nl/</URL> <type>SRU</type> <Views> <view>text</view> </Views> </Endpoint> <Endpoint> <URL>http://corpus1.mpi.nl/ds/imdi_browser/</URL> <type>WebApp/User Interface</type> </Endpoint> </Repository> </Components> </CMD>
indexdata solution
Indexdata/Masterkey? framework (from which we are exampine the Pazpar2-component as aggregator) provides IRSpy (GPL) and Torus (seems proprietary).
Further (future) Issues
- Integration with MD-search
- Integration with VLO
- Integration with the Virtual Collection Registry
- Aggregrator/endpoint.
We note that all three will ideally be integrated in the same manner. For this to happen there are a few conditions that MUST be met:
- All centres generate nice CMDI
- This CMDI is of a same granuality as the search contstrainability in the endpoint
- Everyone includes in the CMDI files a unique set identifier thingy (e.g., the node-id mapping hdl.net thing at the MPI).
- The endpoints understand the unique identifiers from their corresponding CMDI files!