Part of Speech

= Towards a Specification of the FCS-API for individual Content Providers = [[PageOutline(1-3)]] Target Audience: ''Technical Staff of Content Providers'' Specification of the interface that Content Provider willing to join the federated search have to implement. This builds upon the [http://www.loc.gov/standards/sru/ SRU/CQL protocol], but concentrates mainly on the specific agreements on top of the protocol. While this page discuss details of individual features, [[FCS-FeatureMatrix]] is a more compact yet (to be) complete summary of the features to be implemented. == SRU basics == (We have to ensure, that our specification is compatible with the established implementation and usage of the protocol.) === Context Sets === The protocol allows to define own "context sets" (~ namespace) to bind new indices, relations and operators to. A [http://www.loc.gov/standards/sru/resources/context-sets.html number of context sets] is already defined, at least some of which we should also support (e.g. dublincore). We propose further context sets to accomodate our special needs: isocat - isocat.org/datcat :: for Data Categories defined in [http://www.isocat.org/ ISOcat] Data Category Registry ccs - clarin.eu/schema/ccs-v1.0 :: CLARIN Content Search - for indices on content (Annotation Tiers) cmd - clarin.eu/schema/cmd-v1.0 :: CLARIN/Component Metadata - for metadata based indices. It needs to be elaborated further how to integrate existing context sets providing metadata-fields like ''dublincore'' I.e. what would be the relation between '''cmd''' and '''dc''' context sets. The problem is, that CMD shall be the context set for all the indices that are thinkable/usable based on the CMD-Profiles. And for example ''dublincore'' is also defined as a profile in CMD. But on the other hand ''dublincore'' is a model/format on its own, is widely used in the established federated search world (libraries, harvesting etc.), and in particular it already has its own context set in SRU/CQL. So it seems inacceptable to force the repositories to recode this (`dc:` to `cmd:`).[[BR]] The question seems to be if `dc.title` and `cmd.dc.title` can be seen as equivalent. (While it is clear that `cmd.title` is not strictly equivalent but rather an (ambiguous) superset, because it would mean `title`-element from all profiles.) == Explain operation == This basic request serves to announce server's capabilities and should allow the client to configure itself automatically. The explain response should, ideally, provide a list of ISOCatted indexes as possible search indexes. If there is no ISOCat equivalent the CCS-context set is to be used. Example (tentative): {{{ Part of Speech partOfSpeech Words words Phonetics phonetics }}} == searchRetrieve operation == The operation to send the actual query. === Search result === The common ground is the `` defined by the protocol. This goes down to the `` wrapping element. The proposition is to continue with a generic structure being able to encompass "all" the various types of information. (But please also look at the draft of the schema [[source:FederatedSearch/ccsResource.xsd]]): {{{ /* cmd-link is optional */ /* this is for metadata provided directly by the content provider * NOT the CMD metadata. */ {any metadata value} /* is this any useful? */ /* or rather direct metadata-fields like: */ .... Some text with keyword highlighted }}} Resource:: element representing a resource, carrying the identifier. It may represent anything that has a PID (and a MDRecord). So in particular it may also be collections, aggregating other Resources. Allowed children are: `Resource`, `ResourceFragment`, `Metadata` and `DataView` ResourceFragment:: A part of a resource, without own PID, i.e. something addressable with: PID of the Resource + Fragment Identifier. Fragment Identifier to be used depends on the resource type, it may be: XPointer, timecode, sequence-offset, etc. Allowed children are: `Metadata` and `DataView` DataView:: the element carrying the typed data Content can be anything that is in other namespace. The content has to be possible inline or referenced. Important for Images and AV-Files. Metadata:: optional element carrying metadata about the `Resource` or `ResourceFragment`. It can carry an optional parameter `cmd-link` with the PID of a CMD-record. (This only makes sense for `Resource/Metadata`) Although the original idea was to "serialize" all such metadata-fields in a ``-element, I now prefer reusing existing namespaces. `` seems preferable to ``, right? However this nested approach seems not directly compatible with the established SRU-based systems, that rather work on flat fields. And while this can be overcome by providing converter XSL-stylesheets, the information we need seems expressable in a flat structure as well, that makes the more complex (nested) approach questionable: {{{ {PID of the resource} {identifier of the resource-fragment (relative to Resource-PID?)} {PID of the CMD-record} /* optional */ {title of the resource} /* basically any metadata-fields as is standard in SRU-world */ ... Some text with keyword highlighted }}} === Data Views === Here we propose several types of DataViews, the actual format has to be yet defined for most of them, but we should reuse existing formats where possible, so we should look at existing practices and data, but at the same time avoid overspecializing on some specific format. (The CLARIN deliverable [[http://www-sk.let.uu.nl/u/D5C-3.pdf | Interoperability and Standards ]] (2,7 MB) can be used as a starting point.) Discussion about the relationship between data types and corresponding "Viewers", i.e. means of displaying the information to the user under [[Viewable]]. All dataviews of specific types have to be the same in all implementations. That is, if a service presents results as `KWIC`, that should be the same KWIC in all services. kwic:: Keyword in context {{{ Junker Frauenlob , purre knix plautz - Ihr seid ein komischer Kauz - Habt ein Bärtlein von Haaren schwarz , Ziehet es aus mit einem Tropfen Harz Prrrr - ho wird das lang , Kling klang - g - a - d - e , Scheiden thut weh - der Daus }}} Alternatively - to avoid mixed content - the context could be enclosed in separate element as well: {{{ Some text with keyword highlighted }}} Or in the extreme form, every token is wrapped in an element: {{{ Some text with keyword highlighted }}} This comes close to the way the text is encoded in '''TCF''' and would accordingly allow to add (stand-off) annotation layers (lemma, POS, but also syntactic annotations). If there is some associated metadata (like bibliographic information about the source of the hit, this is to be encoded in a separate element ``. Geographic data :: A geographic location, either as coordinates or some location (street, city, place). One established format is [[http://www.opengeospatial.org/standards/kml|KML]] Lexicon Entry :: A entry from an lexicon, dictionary or similar. Something with '''lemma''' with some information about it. There are well established format for lexical and terminological data like [[http://www.lexicalmarkupframework.org/ | Lexical Markup Framework (LMF) ]] or Terminological Markup Framework (TMF - ISO16642). List:: A list of things. These are not primary resources, but rather derived information usually aggregations / frequency lists. This is similar to the scan-operation, or in other words: the result of a scan-operation is also such a list. {{{ [] }}} Example: {{{ Haus 45 Liebe 60 }}} This should enclose also nested lists Matrix:: A matrix containing things. Table as a special type of matrix? Multidimensional? To be defined. Annotated Text :: A bunch of annotated text. We start by supporting the TCF and EAF format as they have existing viewers. For an example EAF file see: [http://corpus1.mpi.nl/qfs1/media-archive/demo/pewi/Annotations/elan-example1.eaf sample file] For now we have annotations/eaf as type and annotations/tcf as type. {{{ }}} Syntax tree :: A special type of annotation. There are dedicated formats for syntactic annotation (Penn Treebank, NeGra Format, SynAF). TCF can also describe syntax trees. === restricting the search by collections=== Restricting the search space shall be done via `x-cmd-domain` (or `x-cmd-context`?) parameter (obsoleting: `x-cmd-collections`). See more under [[SearchContext]] == Scan operation == As an extension to normal SRU the scan response defines a list of searchable collections/domains available at the provider. As a scanClause argument cmd.domains should be used. Example (tentative): {{{ 1.2 MPI86949# 42 The CGN-Corpus (Corpus Gesproken Nederlands) MPI556280# 42 ESF corpus MPI214746# 42 IFA corpus MPI1296694# 42 Childes corpus MPI1259419# 42 Talkbank corpus 1.2 cmd.collections 42 }}} == Configuration issues == Requirements for the endpoint (as detected when trying to access our endpoints via a SRU-based tool `yaz-client`): * `Content-Encoding: text/xml` for the responses * simple base-path (everything after domain is interpreted as database-name (and slashes are escaped))[[BR]] So this works: {{{ http://corpus3.aac.ac.at/ddconsru }}} While this does not: {{{ http://corpus3.aac.ac.at/ddc/sru }}}