= FCS Endpoints = This page was started to preliminary collect information about individual repositories/centers/providers/endpoints within [[FederatedSearch]]. Meanwhile there is the CenterRegistry, that collects and provides the information about the available repositories (see below), that will be the sole source of information about the individual endpoints. However, since not all endpoints already provide a description within the CenterRegistry and also for reference, for the time being the list of endpoints below is being kept. Also, this page details about issues relevant to FCS-endpoints. (But see also [[FCS-specification]].) == Announcing the endpoints == There are two ways of announcing a FCS-endpoint. 1. `CenterProfile` published via the Center Registry 1. `!SearchService-ResourceProxy` in a CMD record of a resource/collection. === !CenterProfile in the Center Registry In SOA a separate registry-service is required to announce the individual services/endpoints. Within CLARIN the CenterRegistry has been introduced to collect and expose organizational/administrational and technical information about individual CLARIN centers. This information is encoded as a dedicated CMD record according to the [[http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629667/xsd|CenterProfile]] [`clarin.eu:cr1:p_1320657629667`]. These descriptions shall cover all aspects (especially all available services) of given center, to the information about the FCS-endpoint is just a part of the description. Following is a snippet of a `CenterProfile` linking to the FCS-endpoint: {{{ #!xml http://weblicht.sfs.uni-tuebingen.de/rws/sru/ CQL }}} ''(Sidenote: There is a semantic/functional overlap of such a CMD record for the endpoint and the `SRU-explain`-response of given service. [[BR]]'''TODO:''' have to clarify the interaction and dependencies. )'' === Discussion: replicating Metadata about resources There is ongoing discussion, how much information about the resources provided via the endpoint should be included in the endpoint description (foremost example being the `Language`), as the resources are actually already described in separate CMD records. So it would be "cleaner" to just link to separate CMD records of a collection or a resource, that would provide information like `ResourceType`, `Language`, available `AnnotationTiers`, `Time/Space Coverage` etc. However that would put an undue burden on the side of the aggregator/client, that would have to crawl through and resolve multiple metadata records, plus be able to make sense of the heterogeneous structure of the CMD records, to get to the required information. Thus for now, `Language` (and possibly a few other basic fields) will be added into the endpoint-description. But there are plans (and prototyping work at Meertens) to '''combine the metadata and content search''', that would allow to filter the content search on any metadata query. === !SearchService-!ResourceProxy Another way of linking to a FCS-endpoint (bottom-up) is by adding a [[http://trac.clarin.eu/wiki/FCS-specification#ReferringtoanSRUendpointfromaCMDIfile|SearchService-ResourceProxy]] into the CMD record of any collection/resource. The obvious semantics is, that the stated FCS-endpoint provides search capabilities over given resource. However because the !SearchService-Proxy leads just to the "nearest" endpoint, which may expose multiple resources, the endpoint has to accept the parameter `x-context`, to restrict the search to given resource. A client/aggregator invoking a search for given resource, passes the PID of the resource as `x-context` parameter. This method is especially important to support the metadata/content search: With the information about the respective `!SearchService` inside the CMD descriptions of the resources, it will be easy to do any metadata query, extract a list of FCS-endpoints from the resultset, and pass it e.g. to the FCS-Aggregator to continue searching in the content. == Implementing Endpoints == Here is one preliminary list of services available (or planned) within the FederatedSearch: ||= Provider =||= DB/Collection =||= link =||= type =||= language =||= status =|| || KNAW || mimore || http://www.meertens.knaw.nl/mimore/srucql/ || dialects Lexicon? || Dutch dialects || online, level 0, open issues, [[http://www.meertens.knaw.nl/mimore/srucql/?operation=searchRetrieve&query=koe&version=1.2| sample query]] || || MPI || IMDI-subset || http://cqlservlet.mpi.nl/ || spoken corpora || Dutch, German, English, French, Swedish || online, level 0, open issues || || INL || INL || http://gysseling.corpus.taalbanknederlands.inl.nl/gyssru/ || text corpus || historic Dutch || online, level 0, open issues, [[http://gysseling.corpus.taalbanknederlands.inl.nl/gyssru/?operation=searchRetrieve&query=a&version=1.2&maximumRecords=10 | sample query]] || || INL || INL || http://brievenalsbuit.inl.nl/zbsru/ || text corpus || historic Dutch || online, level 0, open issues, [[http://brievenalsbuit.inl.nl/zbsru/?operation=searchRetrieve&query=schip&version=1.2&maximumRecords=10 | sample query]] || || DANS || Lieffering || http://srucql.dans.knaw.nl || historical text corpus || Dutch, French || Online, level 1, no known issues,[[http://srucql.dans.knaw.nl?operation=searchRetrieve&query=koe&version=1.2| sample query]] || || ICLTT || C4 || http://corpus3.aac.ac.at/ddconsru || historical text corpus || German || online, level 0, open issues, [[http://corpus3.aac.ac.at/ddconsru?operation=searchRetrieve&query=Wasser | sample query]] || || UPF || El Pais Newspaper Corpus 2005 || http://gilmere.upf.edu/pais_sru || text corpus || Spanish || online, allows index-based queries (e.g. queries only on content `ccs.content=...` [[http://gilmere.upf.edu/pais_sru/?version=1.1&operation=searchRetrieve&query=ccs.content=crisis&startRecord=1&maximumRecords=30&recordSchema=dc | sample query]]) || || Uni Tübingen || Tübingen Baumbank des Deutschen - Diachrones Corpus || http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/sru || text corpus || German || online, level 0, [[http://weblicht.sfs.uni-tuebingen.de/rws/cqp-ws/cqp/sru?version=1.2&operation=searchRetrieve&startRecord=1&maximumRecords=3&query=Umwelt | sample query]] || || IDS || Goethe || http://clarin.ids-mannheim.de/cosmassru || text corpus || German || online, level 0 [[BR]] [[http://clarin.ids-mannheim.de/cosmassru?operation=searchRetrieve&version=1.2&query=Gott | sample query]] [[http://clarin.ids-mannheim.de/cosmassru?operation=scan&version=1.2&scanClause=fcs.resource | enumerate corpora ]] || || IDS || [[http://www.textgrid.de/digitale-bibliothek.html | TextGrid Digital Library ]] (Literature folder) || http://clarin.ids-mannheim.de/digibibsru || text corpus || German || online, level 0, highly experimental [[BR]] [[http://clarin.ids-mannheim.de/digibibsru?operation=searchRetrieve&version=1.2&query=Faustus | sample query]] [[http://clarin.ids-mannheim.de/digibibsru?operation=scan&version=1.2&scanClause=fcs.resource | enumerate corpora ]] || || Uni Leipzig || Leipzig Corpora Collection, Deutscher Wortschatz || http://clarinws.informatik.uni-leipzig.de:8080/CQL || text corpus || multiple languages || online, level 1, [[http://clarinws.informatik.uni-leipzig.de:8080/CQL?operation=searchRetrieve&version=1.2&query=Leipzig&x-context=11858/00-229C-0000-0002-7EC9-9 | sample query]], WIP || || BBAW || DTA Corpus || http://dspin.dwds.de:8088/cgi-bin/FCS-Endpoint/DTA_SRU || historical text corpus || German || online, level 0, [[http://dspin.dwds.de:8088/DDC-Endpoint/sru?version=1.2&operation=searchRetrieve&query=Apfel | sample query]] || || BBAW || Dingler Online || http://dspin.dwds.de:8088/cgi-bin/FCS-Endpoint/DTA_SRU || historical text corpus || German || online, level 0, [[http://dspin.dwds.de:8088/DDC-Dingleros/sru?version=1.2&operation=searchRetrieve&query=Apfel | sample query]] || || BBAW || C4 Corpus || http://dspin.dwds.de:8088/DDC-C4/sru || reference corpus || German || online, level 0, [[http://dspin.dwds.de:8088/DDC-C4/sru?version=1.2&operation=searchRetrieve&query=Apfel | sample query]] || || UdS || GRUG Treebank || http://fedora.clarin-d.uni-saarland.de/sru/ || parallel Treebank || German || online, level 0 [[http://fedora.clarin-d.uni-saarland.de/sru/?operation=searchRetrieve&query=Jazz&version=1.2 | sample query]] || || UdS || Saarbrücken Cookbook Corpora (SaCoCo) || http://fedora.clarin-d.uni-saarland.de/sru2/ || diachronic corpus || German || online, level 0 [[http://fedora.clarin-d.uni-saarland.de/sru2/?operation=searchRetrieve&query=Salse&version=1.2 | sample query]] || || BAS || Various || http://clarin.phonetik.uni-muenchen.de/BASSRU || speech corpora || German, English, Japanese || online, level 0 [[http://clarin.phonetik.uni-muenchen.de/BASSRU?operation=searchRetrieve&query=Jazz&version=1.2 | sample query]] || === List of corpora per endpoint. MPI for Psycholinguistics (cqlservlet.mpi.nl) * [http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI86949%23 CGN] ([http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI86949%23 metadata], [http://lands.let.ru.nl/cgn/doc_English/topics/design/design.htm design]): First language speakers (Dutch), spoken, 9 mio tokens * [http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI1377055%23 ESF] ([http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI1377055%23 metadata], [http://corpus1.mpi.nl/qfs1/media-archive/acqui_data/ac-ESF/Info/esf.html design]): Second language learners (Dutch, German, English, French, Swedish), spoken, at least 500.000 tokens * [http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI214746%23 IFA] ([http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI214746%23 metadata], [http://www.fon.hum.uva.nl/IFAcorpus/ design]): Dutch (hand-segmented), spoken, 50.000 tokens * [http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI1296694%23 Childes] ([http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI1296694%23 metadata], [http://childes.psy.cmu.edu/ design]): Child language of talkbank, spoken, about [http://catalog.clarin.eu/ds/vlo/?wicket:bookmarkablePage=:eu.clarin.cmdi.vlo.pages.ShowAllFacetValuesPage&fq=collection:childes&selectedFacet=language 43 languages], 28.000 CHAT files, millions of tokens * [http://corpus1.mpi.nl/ds/trova/search.jsp?nodeid=MPI1259419%23 Talkbank], ([http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI1259419%23 metadata], [http://talkbank.org/ design]), spoken language, about [http://catalog.clarin.eu/ds/vlo/?wicket:bookmarkablePage=:eu.clarin.cmdi.vlo.pages.ShowAllFacetValuesPage&fq=collection:talkbank&selectedFacet=language 37 languages], 14.000 CHAT files, a few million tokens INL (gysseling.corpus.taalbanknederlands.inl.nl/cqlwebapp/cql) * [http://gysseling.corpus.taalbanknederlands.inl.nl/cqlwebapp/search.html Corpus Gysseling], collection of all thirteenth-century texts that have served as source material for the [http://vmnw.inl.nl Early Middle Dutch Dictionary] (Dutch) 1.5 mio tokens == Candidate Services == ||= Provider =||= DB/Collection =||= link =||= type =||= language =||= status =|| || OTA || BNC ? || http://ota.oucs.ox.ac.uk/ || text corpus || English || potential || || OTA || TRACTOR-archive || || text corpus? || Central an East EU || potential || || BAS || Speech Corpora || || spoken corpora || German || planned || || Gothenburg Uni || Spraakbanken || http://litteraturbanken.se/ || literary texts || Swedish? || planned? || || Gothenburg Uni || Spraakbanken || http://demosb.spraakdata.gu.se/korp/ || text corpus || Swedish? || planned? || || UPF || El Pais Newspaper Corpus (Metadata) || http://gilmere.upf.edu/girona_sru || text corpus || Catalan || only metadata (date, lang), [[ http://gilmere.upf.edu/girona_sru/?version=1.1&operation=searchRetrieve&query=dc.coverage=2006-07-17&startRecord=1&maximumRecords=30&recordSchema=dc | sample query]] || || ICLTT || CMDI-MDService || http://clarin.aac.ac.at/MDService2/sru || Metadata || multiple langs || online, but very sloppy, too different || And following some reference SRU-services: || ''Provider'' || ''endpoint'' || ''info'' || || Library of Congress || http://z3950.loc.gov:7090/voyager || ? || || Oxford English Dictionary || http://www.oed.com/srupage || [http://www.oed.com/public/sruservice/sru-service info about OED-SRU service] || || Gutenberg (metadata) provided by indexdata || http://opencontent.indexdata.com/gutenberg || [http://www.indexdata.com/opencontent project Open Content] || See also for DE endpoint candidates: http://www.clarin-d.de/mwiki/index.php/Corpora_f%C3%BCr_den_Federated_Content_Search == SRU/CQL conformance testing == A SRU Server Tester for testing basic protocol conformance is available at: http://alcme.oclc.org/srw/SRUServerTester.html Preliminary CLARIN FCS SRU/CQL tester at (requires authentication): http://clarin.ids-mannheim.de/srutest/ == Current Issues == At the time of writing (3th of may) there are some issues with all of the sru/cql implementations. We will list them here per service. === http://www.meertens.knaw.nl/mimore/srucql/ === * Does not implement x-cmd-collections * `` is a closed element instead of wrapping around the * How to generate link to resource containing hit? === http://cqlservlet.mpi.nl/ === * Not XML-ified content in the DataView element. Should have at least wrapping of elements around the hit. === http://corpus3.aac.ac.at/ddconsru === * Does not implement scan on cmd.collections * How to generate link to resource containing hit? === http://gilmere.upf.edu/girona_sru === * Encodes searchretrieve responce in DCU format (and not our format) * Does not seem to use x-cmd-collections === http://clarin.aac.ac.at/MDService2/sru === * searchRetrieve returns a completely different xml thing. * scan on cmd.collections generates a nullpointerexception == Requirements == Monitoring :: an associated service must check regularly the availability of the services. (and perhaps even more) - a [source:monitoring/plugins/mpi/check_lat_cql_endpoint.py simple nagios plugin] is available == Further (future) Issues == * Integration with MD-search * Integration with VLO * Integration with the Virtual Collection Registry * Aggregrator/endpoint. We note that all three will ideally be integrated in the same manner. For this to happen there are a few conditions that MUST be met: * All centres generate nice CMDI * This CMDI is of a same granuality as the search contstrainability in the endpoint * Everyone includes in the CMDI files a unique set identifier thingy (e.g., the node-id mapping hdl.net thing at the MPI). * The endpoints understand the unique identifiers from their corresponding CMDI files!