Version 36 (modified by 12 years ago) (diff) | ,
---|
Repository Registry
We need a service/registry that informs about the available repositories.
Until we have that formalized, we will collect the information about individual providers here.
Implementing Endpoints
Here is one preliminary list of services available (or planned) within the FederatedSearch - EDC
Provider | DB/Collection | link | type | language | status |
---|---|---|---|---|---|
KNAW | mimore | http://www.meertens.knaw.nl/mimore/srucql/ | dialects Lexicon? | Dutch dialects | online, level 0, open issues, sample query |
MPI | IMDI-subset | http://cqlservlet.mpi.nl/ | spoken corpora | Dutch, German, English, French, Swedish | online, level 0, open issues |
INL | INL? | http://gysseling.corpus.taalbanknederlands.inl.nl/cqlwebapp/cql | text corpus | historic Dutch | online, level 0, open issues, sample query |
DANS | Lieffering | http://srucql.dans.knaw.nl | historical text corpus | Dutch, French | Online, level 1, no known issues, sample query |
ICLTT | C4 | http://corpus3.aac.ac.at/ddconsru | historical text corpus | German | online, level 0, open issues, sample query |
UPF | El Pais Newspaper Corpus 2005 | http://gilmere.upf.edu/pais_sru | text corpus | Spanish | online, allows index-based queries (e.g. queries only on content ccs.content=... sample query)
|
Uni Tübingen | Tübingen Baumbank des Deutschen - Diachrones Corpus | http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/sru | text corpus | German | online, level 0, sample query |
IDS | Goethe | http://clarin.ids-mannheim.de/cosmassru | text corpus | German | online, level 0, sample query |
IDS | TextGrid Digital Library (Literature folder) | http://clarin.ids-mannheim.de/digibibsru | text corpus | German | online, level 0, sample query, experimental |
Uni Leipzig | Leipzig Corpora Collection, Deutscher Wortschatz | http://clarinws.informatik.uni-leipzig.de:8080/CQL | text corpus | multiple languages | online, level 1, sample query, WIP |
List of corpora per endpoint.
MPI for Psycholinguistics (cqlservlet.mpi.nl)
- CGN (metadata, design): First language speakers (Dutch), spoken, 9 mio tokens
- ESF (metadata, design): Second language learners (Dutch, German, English, French, Swedish), spoken, at least 500.000 tokens
- IFA (metadata, design): Dutch (hand-segmented), spoken, 50.000 tokens
- Childes (metadata, design): Child language of talkbank, spoken, about 43 languages, 28.000 CHAT files, millions of tokens
- Talkbank, (metadata, design), spoken language, about 37 languages, 14.000 CHAT files, a few million tokens
Candidate Services
Provider | DB/Collection | link | type | language | status |
---|---|---|---|---|---|
OTA | BNC ? | http://ota.oucs.ox.ac.uk/ | text corpus | English | potential |
OTA | TRACTOR-archive | text corpus? | Central an East EU | potential | |
BAS | Speech Corpora | spoken corpora | German | planned | |
Gothenburg Uni | Spraakbanken | http://litteraturbanken.se/ | literary texts | Swedish? | planned? |
Gothenburg Uni | Spraakbanken | http://demosb.spraakdata.gu.se/korp/ | text corpus | Swedish? | planned? |
UPF | El Pais Newspaper Corpus (Metadata) | http://gilmere.upf.edu/girona_sru | text corpus | Catalan | only metadata (date, lang), sample query? |
ICLTT | CMDI-MDService | http://clarin.aac.ac.at/MDService2/sru | Metadata | multiple langs | online, but very sloppy, too different |
And following some reference SRU-services:
Provider | endpoint | info |
Library of Congress | http://z3950.loc.gov:7090/voyager | ? |
Oxford English Dictionary | http://www.oed.com/srupage | info about OED-SRU service |
Gutenberg (metadata) provided by indexdata | http://opencontent.indexdata.com/gutenberg | project Open Content |
See also for DE endpoint candidates: http://www.clarin-d.de/mwiki/index.php/Corpora_f%C3%BCr_den_Federated_Content_Search
A SRU Server Tester for testing basic protocol conformance is available at: http://alcme.oclc.org/srw/SRUServerTester.html
Current Issues
At the time of writing (3th of may) there are some issues with all of the sru/cql implementations. We will list them here per service.
http://www.meertens.knaw.nl/mimore/srucql/
- Does not implement x-cmd-collections
<sru:recordData/>
is a closed element instead of wrapping around the <css:Resource>- How to generate link to resource containing hit?
http://cqlservlet.mpi.nl/
- Not XML-ified content in the DataView? element. Should have at least wrapping of elements around the hit.
http://corpus3.aac.ac.at/ddconsru
- Does not implement scan on cmd.collections
- How to generate link to resource containing hit?
http://gilmere.upf.edu/girona_sru
- Encodes searchretrieve responce in DCU format (and not our format)
- Does not seem to use x-cmd-collections
http://clarin.aac.ac.at/MDService2/sru
- searchRetrieve returns a completely different xml thing.
- scan on cmd.collections generates a nullpointerexception
http://ds.dev.clarin.inl.nl/cqlwebapp/cql
- Doesn't seem to return any results.
Requirements
- Monitoring
- an associated service must check regularly the availability of the services. (and perhaps even more) - a simple nagios plugin is available
Ideas on implementation of the Registry
As already the above list suggests, simple list of endpoints will not be enough. Even if we try to rely as much as possible on the auto-configuration based on the explain record, additional information (Metadata) needs to be stored and provided.
So one obvious choice would be to define a CMD-Profile for the Repositories.
This profile would carry only the minimal necessary primarily technical information. In particular it shouldn't provide any information about the data provided, but rather link to separate MDRecords of a collection or a resource, that would provide information like ResourceType
, Language
, available AnnotationTiers
, Time/Space Coverage
etc.
We also have to beware of a big semantic/functional overlap of such a Repository
-MDRecord and the SRU-explain
-response of given service and TODO: have to clarify the interaction and dependencies.
Following is a proposal of a sample Repository
-instance:
<CMD> <Resources> <ResourceProxyList> <!-- should refer to collections or resources that are reachable by given Repository. --> <ResourceProxy id="mpi-subcorpus"> <ResourceType>Metadata</ResourceType> <ResourceRef>{collection-handle}</ResourceRef> </ResourceProxy> </ResourceProxyList> </Resources> <Components> <Repository> <GeneralInfo> <ID>?</ID> <Name>MPI corpus</Name> <Description>MPI Corpora: ESF, CGN, ...</Description> </GeneralInfo> <Endpoint> <URL>http://cqlservlet.mpi.nl/</URL> <type>SRU</type> <Views> <view>text</view> </Views> </Endpoint> <Endpoint> <URL>http://corpus1.mpi.nl/ds/imdi_browser/</URL> <type>WebApp/User Interface</type> </Endpoint> </Repository> </Components> </CMD>
indexdata solution
Indexdata/Masterkey? framework (from which we are exampine the Pazpar2-component as aggregator) provides IRSpy (GPL) and Torus (seems proprietary).
Further (future) Issues
- Integration with MD-search
- Integration with VLO
- Integration with the Virtual Collection Registry
- Aggregrator/endpoint.
We note that all three will ideally be integrated in the same manner. For this to happen there are a few conditions that MUST be met:
- All centres generate nice CMDI
- This CMDI is of a same granuality as the search contstrainability in the endpoint
- Everyone includes in the CMDI files a unique set identifier thingy (e.g., the node-id mapping hdl.net thing at the MPI).
- The endpoints understand the unique identifiers from their corresponding CMDI files!