= CMDI Metadata Services = The CmdiMetadataServices component (MDService) is the core component on the consumption side of CMDI. This page contains the abstract interface specification, there is a [wiki:MDServiceImpl special page detailing on the implementation decisions] and another one about the tightly related MetadataBrowser. {{{ #!div class="important" MDService accepts queries about metadata from MetadataBrowser (and external Applications)[[BR]] - and passes them to the [wiki:CmdiRepository Metadata Repository](ies) [[BR]] - and/or to the [CmdiVirtualCollection Virtual Collection Registry], [[BR]] - optionally applying SemanticMapping based on the information from ComponentRegistry, Data Category Registries ([http://www.isocat.org isoCAT], dublin core) and [CmdiRelationRegistry Relation Registry][[BR]] - receiving results and passing them back to the requesting node. }}} [[Image(MDServices_only.png, 500, right)]] So MDService interacts with: * MetadataBrowser[[BR]] as provider * [CmdiRepository MetadataRepository][[BR]] as consumer * ComponentRegistry[[BR]] as consumer * RelationRegistry[[BR]] as consumer * [CmdiVirtualCollection VirtualCollectionRegistry]?[[BR]] as consumer /provider? * external Applications[[BR]] as provider; but perhaps also consumer thinkable, when a potential external metadata repository provides a (XML)search interface The main competencies/tasks/functions are: serve MD-Records:: via browsing/searching[[BR]] passing the request to MDRepository provide Vocabularies:: (out of existing schemas)[[BR]] for the end-user (and user-agent)[[BR]] needed to be able to formulate a query. (select lists, catalogue, ..) translate:: from Vocabulary-based user query to XPath/XQuery for the MDRepository (and consequently communicate with MDRepository)[[BR]] '''optionally employing semantic mapping''' involving RelationRegistry Side note: We could think of subtyping MDService or adding "DataPorts" which would make MDService more "multilingual", so that it can access different MD Repositories (besides the native CMDI MDRep). So there could be a DataPort for a RDB based repository (well supported by XQuery) or even for Filesystem based repositories. (Which would come near to LocalWorkspace functionality. See below.) == Interface Definition == There was a longer development of the interface specification for MDRepository/MDService. There was: * the '''initial''' proposal by Daan (2009-10)[[BR]] with the operations `.elements(), .values(),` * 1st version ('''extension''') by Matej [[BR]] with `.listSchemas(), .listDimensions(), .listVocabularies(), .stats(), .countMetadataEntries() ...` * "'''consolidated'''" version by Matej:[[BR]]`.getCollections(), .queryModel(), .searchRetrieve(query, result_options)` [[BR]] this set of operation is currently implemented by MDRepository * based on the consolidated version a "'''convenience'''" interface of MDService evolved with:[[BR]] `.collections(), .model(), .values(), .terms(), .recordset(), .record()` Following table tries to give an overview of correspondence between these and especially also the relation to the actual standard '''SRU/CQL''', we want to conform to: || ''Daan'' || ''consolidated (MDRepo)'' || ''convenience (MDService)'' || ''SRU/CQL'' || || || `getCollections` || `collections` || (`explain`, Zeerex)? || || `elements`|| `queryModel` || `model`, `terms` || `explain` || || `values` || `scanIndex` || `values` || `scan` || || || `searchRetrieve` || `recordset`, `record` || `searchRetrieve` || With current state of implementation, the ''"consolidated"'' interface-version implemented by the MDRepository presents the factual core. MDService builds upon this providing semantically equivalent but richer interface (mainly wrt formatting), which is described in [source:MDService2/trunk/MDService2/docs/MDService2.wadl WADL] [http://trac.clarin.eu/export/1033/MDService2/trunk/MDService2/docs/MDService2.wadl.html WADL-html] The following subchapters concentrate on the description of the three operations of the ''"consolidated"''-interface, preceded by a chapter about situation and plans regarding SRU/CQL-compliance. `getCollections()`:: provides information the hierarchical structure the md-records are grouped in. `queryModel()`:: provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository. As query-parameter it only needs to accept/understand `cmdIndex`. `searchRetrieve()` :: is the core operation to retrieve the actual MDRecords. In the `query`-parameter it has to accept any query on metadata. === Notes on MDService’s SRU/CQL-standard conformance === Although the interface specification was inspired by the SRU/CQL protocol, during the prototyping phase unfortunately an idiosyncratic interface evolved (see ''"consolidated"'' and ''"convenience"'' mentions above), MDRepository and MDService currently exposing their own set of operations. Thus at the moment MDService can’t claim conformance with SRU/CQL standard. However we are still committed to this goal and we have substantial partial solutions: MDService accepts the query in CQL-format (even parsing it, ensuring the syntactic validity) and returns a `` result as defined by the protocol. An example: `title contains system` in [http://clarin.aac.ac.at/MDService2/recordset/xml?q=%20%28%20%20%28%20title%20contains%20system%29%20%20%29%20&startRecord=1&maximumRecords=10&repository=1 MDService] The result comes in the standard-compliant format already from MDRepository, which however expects the query in XPath and there is also no plan for the MDRepository to accept SRU/CQL-queries, thus MDRepository is not to be expected SRU-conformant. The main missing part is the appropriate REST-interface accepting the defined parameters. As MDservice exposes this own set of operation optimized mainly for the use by the web-application (MDBrowser), the plan is to provide an additional separate interface exposing the SRU-operations, mapping these operations internally to appropriate processing. === getCollections() === Rough equivalent of OAI-PMH's `listSets`. Lists collections-tree. This information is needed especially by the other interfaces, that take `collection` as constraining parameter. See also #1. === queryModel() === This interface provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository (+ the counts). For this it should merge together the information on the `CMD_Components` and `CMD_Elements` as defined in the ComponentRegistry with the real data from the MDRepository. Provided with a component path it returns applicable elements or values in given context, e.g.: {{{ /Actor/Contact => [Name,Address,Organization,...] Language => {List of Languages} }}} request-pattern:: {{{ http://mdservice/queryModel?q[collection=all|format=xml|maxdepth=1] http://mdservice?operation=queryModel&q[collection|format|maxdepth] }}} parameters:: || ''param'' || ''value'' || ''description'' || || q || `cmdIndex` as defined in [wiki:QueryLanguage#SyntaxGrammar] || provides context from where the result-tree shall start || || collection || identifier of one (or multiple) collection(s) default:=all || constrains the query to a specific selection of collections to provide the model for || || format || [list|xml|json|html?] default:=xml || format of the result|| || maxdepth || positive integer (probably restricted on the server side to some reasonable allowed maximum); default:=1 || the desired of the depth of the model subtree || ==== Result format ==== The returned values are: || ''value'' || ''xml encoding'' || ''format'' || || component/element id || `Term@cid` || uri || || component/element name || `Term@name` || componentName || || component/element path || `Term@path` || cmdIndex or XPath ? || || count element in the repository || `Term@count` || posInteger || || count element with text || `Term@count_text` || posInteger || || count distinct values in given element|| `Term@count_distinct` || posInteger || || distinct values in given element || `Term/Value` || string || || count number of uses of given values in given element || `Term/Value@count` || string || In general the result can be a '''subtree of the model of arbitrary depth'''. Depending on the `maxdepth`-parameter it returns either simple lists, or whole model subtree. In the interaction with MDBrowser this could mean: 1. '''model subtree''' in the initialization phase 1. '''simple list''' during query-building (has to be fast!) Examples for a ''simple list'' result: {{{ request: http://mdservice/queryModel?q=MDGroup.Actors.Actor (defaults: collection=all, format=xml, maxdepth=1) result: }}} {{{ request: http://mdservice/queryModel?q=Sex result: }}} Example for a ''model subtree'' : {{{ request: http://mdservice/queryModel?q=MDGroup.Actors.Actor&maxdepth=3 result: Unspecified Unknown }}} Probably a (one or multiple) '''initial model''' should be defined/constructed, a default set which balances size of the data and ''information content'' by providing the ''most relevant'' elements to have a rich, but no too big starting point for the searches. Ideas for `elements` to be included in this default set: * Language (Country) * Institute,Project * Type/Modality * Accessibility - this is problematic - it may in many cases not be determinable by MD alone (special AAI policies managed by Content Provider etc.); but nevertheless some rough indication should be available! * Scope|coverage (/Time,Space,Genre) === searchRetrieve() === Core interface retrieving the actual MDRecords based on a query. This is meant to be compatible to the [http://www.loc.gov/standards/sru/specs/search-retrieve.html searchRetrieve operation of SRU/CQL] request-pattern:: {{{ http://cmdrepository/rest/mdservice?operation=searchRetrieve&[query|version|...] }}} parameters:: || ''param'' || ''value'' || ''description'' || || query || extended CQL as defined in [wiki:QueryLanguage#SyntaxGrammar] || the whole query || || {all the params defined in SRU} || || consult SRU specification|| || (collection) || [0..*] collection-identifiers || this shall also be encoded in the query, not as separate parameter || ==== Result format ==== ''SRU'' also defines the format of the result, distinguishing: * Response Parameters * Record Parameters * Record Packing - Result Sets The parameters (encoded as elements in the xml response) handle just generic information and also provide wrappers for the actual "record data" (`recordData`, `extraRecordData`), where our CMD records should fit in. One could argue Which information (from the individual MD-Records) shall be included in the Result set has to be sorted out yet, but although it may be certain waste of bandwith to send around whole MD-Records, it saves additional processing, and allows for richer manipulation nearer to the client. (If the user wants to see more field, a full request to MDRepository would not be necessary, as the whole MD-Records would reside either in the session-cache of MDService or MDBrowser.) SRU foresees the request parameters: `recordSchema`, `stylesheet` to be used by the client to influence the format of the result. === optional/questionable Interfaces === A further experimental interface would be: {{{ searchByExample (Collection[MD-Entries]) => Collection[MD-Entries] }}} Based on provided MD-Entries try to find similar this would require sophisticated processing: 1. identifying set of relevant terms 1. possibly provide the identified terms to the user, for priority ordering 1. adaptive search based on these terms, loosening or tightening the criteria depending on the size of the result-set An optional separate interface for publishing Resources (not the OAI-PMH way): {{{ .publishResource(MD) -> MDRep.registerResource() | VCR.registerResource() }}} Relevant resources could be 1. VirtualCollections, 1. user's "private" Resources she wants to share or 1. new Resources as results of processing existing ones. It is questionable, if this should be the competency of MDService/MDBrowser in general.[[BR]] But regarding the VirtualCollections MetadataBrowser will be the natural place for creating them (as result of MD search). (see [MetadataBrowser there] for more details) == Domain Model == [[Image(MDService_model_classes.png, 515, right)]] Following a class diagram of the domain model (business objects) planned for MDService. Generally it is the entities being carried in the various XML-Streams, which rather sooner than later MDService, needs to be made aware of ('''JAXB''' - as the technology of choice for transforming between XML and Java-objects), to be able to do some sophisticated stuff with it. A few remarks on this: Term:: a special class is proposed, that would abstract from the specifics of the classes from registries: `CMD_Component`, `CMD_Element`, `DataCategory`. It is not a superclass, but rather a special class, that models the dependencies between the other classes / datatypes mentioned. Collection:: the basic hierarchical structure in which the MD-Records are organized A query without specified collection is evaluated against whole Repository (all collections) Terms vs. Collection:: In every query, or generally for any viewing of the MDRepository we have to be aware at least of the two dimensions: '''Terms''' (all the existing MD-Elements) and '''Collections'''. This is especially important for any statistics for individual Terms (`TermStats`), they always have to be provided in the defined collection-context (all collections by default?) Virtual Collection :: From MDService point of view Virtual Collection is a subclass of Collection, meaning it can be handled as Collection. The distinction is, that it may come from other source (VCR - see discussion under: MetadataBrowser#VirtualCollection), and has some other specifics (user-specific, user can create). Query:: the "basic unit" modelling one user-request. QuerySet:: a collection of Queries. History:: a subclass of QuerySet, collecting all queries submitted byt the user. == AAI == Access to the Metadata has to be open, thus no AAI necessary here. On the other hand modelling user allows for extended features, so we plan optional login, following the common approach adopted in CMDI as whole.