wiki:CmdiMetadataServices

CMDI Metadata Services

The CmdiMetadataServices component (MDService) is the core component on the consumption side of CMDI.

This page contains the abstract interface specification, there is a special page detailing on the implementation decisions and another one about the tightly related MetadataBrowser.

MDService accepts queries about metadata from MetadataBrowser (and external Applications)

So MDService interacts with:

The main competencies/tasks/functions are:

serve MD-Records
via browsing/searching
passing the request to MDRepository
provide Vocabularies
(out of existing schemas)
for the end-user (and user-agent)
needed to be able to formulate a query. (select lists, catalogue, ..)
translate
from Vocabulary-based user query to XPath/XQuery for the MDRepository (and consequently communicate with MDRepository)
optionally employing semantic mapping involving RelationRegistry

Side note: We could think of subtyping MDService or adding "DataPorts?" which would make MDService more "multilingual", so that it can access different MD Repositories (besides the native CMDI MDRep). So there could be a DataPort? for a RDB based repository (well supported by XQuery) or even for Filesystem based repositories. (Which would come near to LocalWorkspace? functionality. See below.)

Interface Definition

There was a longer development of the interface specification for MDRepository/MDService. There was:

  • the initial proposal by Daan (2009-10)
    with the operations .elements(), .values(),
  • 1st version (extension) by Matej
    with .listSchemas(), .listDimensions(), .listVocabularies(), .stats(), .countMetadataEntries() ...
  • "consolidated" version by Matej:
    .getCollections(), .queryModel(), .searchRetrieve(query, result_options)
    this set of operation is currently implemented by MDRepository
  • based on the consolidated version a "convenience" interface of MDService evolved with:
    .collections(), .model(), .values(), .terms(), .recordset(), .record()

Following table tries to give an overview of correspondence between these and especially also the relation to the actual standard SRU/CQL, we want to conform to:

Daan consolidated (MDRepo) convenience (MDService) SRU/CQL
getCollections collections (explain, Zeerex)?
elements queryModel model, terms explain
values scanIndex values scan
searchRetrieve recordset, record searchRetrieve

With current state of implementation, the "consolidated" interface-version implemented by the MDRepository presents the factual core. MDService builds upon this providing semantically equivalent but richer interface (mainly wrt formatting), which is described in WADL WADL-html

The following subchapters concentrate on the description of the three operations of the "consolidated"-interface, preceded by a chapter about situation and plans regarding SRU/CQL-compliance.

getCollections()
provides information the hierarchical structure the md-records are grouped in.
queryModel()
provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository. As query-parameter it only needs to accept/understand cmdIndex.
searchRetrieve()
is the core operation to retrieve the actual MDRecords. In the query-parameter it has to accept any query on metadata.

Notes on MDService’s SRU/CQL-standard conformance

Although the interface specification was inspired by the SRU/CQL protocol, during the prototyping phase unfortunately an idiosyncratic interface evolved (see "consolidated" and "convenience" mentions above), MDRepository and MDService currently exposing their own set of operations. Thus at the moment MDService can’t claim conformance with SRU/CQL standard. However we are still committed to this goal and we have substantial partial solutions: MDService accepts the query in CQL-format (even parsing it, ensuring the syntactic validity) and returns a <searchRetrieveResponse> result as defined by the protocol.

An example: title contains system in MDService

The result comes in the standard-compliant format already from MDRepository, which however expects the query in XPath and there is also no plan for the MDRepository to accept SRU/CQL-queries, thus MDRepository is not to be expected SRU-conformant.

The main missing part is the appropriate REST-interface accepting the defined parameters. As MDservice exposes this own set of operation optimized mainly for the use by the web-application (MDBrowser), the plan is to provide an additional separate interface exposing the SRU-operations, mapping these operations internally to appropriate processing.

getCollections()

Rough equivalent of OAI-PMH's listSets. Lists collections-tree. This information is needed especially by the other interfaces, that take collection as constraining parameter. See also #1.

queryModel()

This interface provides information about the metadata (meta-model?), ie which components/elements/values are used in the repository (+ the counts). For this it should merge together the information on the CMD_Components and CMD_Elements as defined in the ComponentRegistry with the real data from the MDRepository.

Provided with a component path it returns applicable elements or values in given context, e.g.:

 /Actor/Contact => [Name,Address,Organization,...]
 Language => {List of Languages}
request-pattern
 http://mdservice/queryModel?q[collection=all|format=xml|maxdepth=1]
 http://mdservice?operation=queryModel&q[collection|format|maxdepth]
parameters
param value description
q cmdIndex as defined in QueryLanguage provides context from where the result-tree shall start
collection identifier of one (or multiple) collection(s) default:=all constrains the query to a specific selection of collections to provide the model for
format [list|xml|json|html?] default:=xml format of the result
maxdepth positive integer (probably restricted on the server side to some reasonable allowed maximum); default:=1 the desired of the depth of the model subtree

Result format

The returned values are:

value xml encoding format
component/element id Term@cid uri
component/element name Term@name componentName
component/element path Term@path cmdIndex or XPath ?
count element in the repository Term@count posInteger
count element with text Term@count_text posInteger
count distinct values in given element Term@count_distinct posInteger
distinct values in given element Term/Value string
count number of uses of given values in given element Term/Value@count string

In general the result can be a subtree of the model of arbitrary depth. Depending on the maxdepth-parameter it returns either simple lists, or whole model subtree. In the interaction with MDBrowser this could mean:

  1. model subtree in the initialization phase
  2. simple list during query-building (has to be fast!)

Examples for a simple list result:

 request:
 http://mdservice/queryModel?q=MDGroup.Actors.Actor
   (defaults: collection=all, format=xml, maxdepth=1)

 result:
  <Term id="clarin.eu:2626" name="Role" path="//MDGroup/Actors/Actor/Role" 
        count="14500" count_text="13700" count_distinct="14" />
  <Term id="clarin.eu:2627" name="Name" path="//MDGroup/Actors/Actor/Name" 
        count="14500" count_text="14200" count_distinct="14180" />
 request:
 http://mdservice/queryModel?q=Sex
 
 result:
   <Term id="" name="Sex" path="" count="14500" count_text="14500" count_distinct="2" >
     <Value name="Unspecified" count="1500" />
     <Value name="Unknown" count="13000"  />
   </Term>

Example for a model subtree :

 request:
 http://mdservice/queryModel?q=MDGroup.Actors.Actor&maxdepth=3
 
 result:
 <Term id="" name="Actor" path="//MDGroup/Actors/Actor" count="14500" count-count_distinct="9200" >
   <Term id="" name="Contact" path="//MDGroup/Actors/Actor/Contact" count="13700" count_distinct="12000" >
      <Term id="" name="Address" path="" count="300" count_textcount_distinct="300" />
   </Term>
   <Term id="" name="Role" path="" count="9900" count_text="1347" count_distinct="20" />
   <Term id="" name="Sex" path="" count="14500" count_text="14500" count_distinct="2" >
     <Value count="1500" >Unspecified</Value>
     <Value count="13000" >Unknown</Value>
   </Term>
 </Term>

Probably a (one or multiple) initial model should be defined/constructed, a default set which balances size of the data and information content by providing the most relevant elements to have a rich, but no too big starting point for the searches. Ideas for elements to be included in this default set:

  • Language (Country)
  • Institute,Project
  • Type/Modality?
  • Accessibility - this is problematic - it may in many cases not be determinable by MD alone (special AAI policies managed by Content Provider etc.); but nevertheless some rough indication should be available!
  • Scope|coverage (/Time,Space,Genre)

searchRetrieve()

Core interface retrieving the actual MDRecords based on a query. This is meant to be compatible to the searchRetrieve operation of SRU/CQL

request-pattern
 http://cmdrepository/rest/mdservice?operation=searchRetrieve&[query|version|...]
parameters
param value description
query extended CQL as defined in QueryLanguage the whole query
{all the params defined in SRU} consult SRU specification
(collection) [0..*] collection-identifiers this shall also be encoded in the query, not as separate parameter

Result format

SRU also defines the format of the result, distinguishing:

  • Response Parameters
  • Record Parameters
  • Record Packing - Result Sets

The parameters (encoded as elements in the xml response) handle just generic information and also provide wrappers for the actual "record data" (recordData, extraRecordData), where our CMD records should fit in.

One could argue Which information (from the individual MD-Records) shall be included in the Result set has to be sorted out yet, but although it may be certain waste of bandwith to send around whole MD-Records, it saves additional processing, and allows for richer manipulation nearer to the client. (If the user wants to see more field, a full request to MDRepository would not be necessary, as the whole MD-Records would reside either in the session-cache of MDService or MDBrowser.)

SRU foresees the request parameters: recordSchema, stylesheet to be used by the client to influence the format of the result.

optional/questionable Interfaces

A further experimental interface would be:

 searchByExample (Collection[MD-Entries])
     => Collection[MD-Entries] 

Based on provided MD-Entries try to find similar

this would require sophisticated processing:

  1. identifying set of relevant terms
  2. possibly provide the identified terms to the user, for priority ordering
  3. adaptive search based on these terms, loosening or tightening the criteria depending on the size of the result-set

An optional separate interface for publishing Resources (not the OAI-PMH way):

 .publishResource(MD)
   -> MDRep.registerResource() | VCR.registerResource()

Relevant resources could be

  1. VirtualCollections?,
  2. user's "private" Resources she wants to share or
  3. new Resources as results of processing existing ones.

It is questionable, if this should be the competency of MDService/MDBrowser in general.
But regarding the VirtualCollections? MetadataBrowser will be the natural place for creating them (as result of MD search). (see there for more details)

Domain Model

Class Diagram of the planned model for MDService

Following a class diagram of the domain model (business objects) planned for MDService. Generally it is the entities being carried in the various XML-Streams, which rather sooner than later MDService, needs to be made aware of (JAXB - as the technology of choice for transforming between XML and Java-objects), to be able to do some sophisticated stuff with it. A few remarks on this:

Term
a special class is proposed, that would abstract from the specifics of the classes from registries: CMD_Component, CMD_Element, DataCategory. It is not a superclass, but rather a special class, that models the dependencies between the other classes / datatypes mentioned.
Collection
the basic hierarchical structure in which the MD-Records are organized A query without specified collection is evaluated against whole Repository (all collections)
Terms vs. Collection
In every query, or generally for any viewing of the MDRepository we have to be aware at least of the two dimensions: Terms (all the existing MD-Elements) and Collections. This is especially important for any statistics for individual Terms (TermStats), they always have to be provided in the defined collection-context (all collections by default?)
Virtual Collection
From MDService point of view Virtual Collection is a subclass of Collection, meaning it can be handled as Collection. The distinction is, that it may come from other source (VCR - see discussion under: MetadataBrowser#VirtualCollection), and has some other specifics (user-specific, user can create).
Query
the "basic unit" modelling one user-request.
QuerySet?
a collection of Queries.
History
a subclass of QuerySet?, collecting all queries submitted byt the user.

AAI

Access to the Metadata has to be open, thus no AAI necessary here. On the other hand modelling user allows for extended features, so we plan optional login, following the common approach adopted in CMDI as whole.

Last modified 13 years ago Last modified on 01/15/11 13:36:12

Attachments (3)

Download all attachments as: .zip