wiki:CDMDC

Combined Distributed Metadata Content Search

(The related chapters in the FederatedSearch doc are 4.6 Combined Metadata and Content Query and 6.3 Combined Metadata Content Search.)

A combined metadata content search shall allow to restrict the context of the content search by metadata filters. Simple examples:

 Find all occurrences where a female Actor said “Ja”. 
 Find all occurrences of “viable system” in texts where the tile contains "architecture" 
    and the Organisation responsible for the collection the text is part of is a university.

Generally it should be possible to formulate the whole query (both metadata and content part) in one query string (with the same CQL-syntax):

 Actor.Sex=f AND text = “Ja”
 text = “viable system” AND title = architecture AND Organisation ISA University.

There may also be an issue with such combined queries, that it is not clearly decidable, which index is handled by which component (metadata or content). This is further dicussed under intensional filter:

combined metadata content query

Control Flow in a Combined Metadata Content Search

The process can be basically divided into two phases:

  1. metadata search – restricts the candidate collections to search in, based on metadata part of the query.

It returns a list of candidates, which is used in:

  1. content search – by the federated content search, to iterate through and issue the content query to each candidate in turn.

However there are (at least) two issues wrt to the "candidates" (that is in the interaction between the metadata and content components):

  • The ability of the target repository to restrict the domain of the content search
  • Linkage between the MDRecord and the Repository that provides corresponding resource

These two are discussed in following sections:

Restricting the search domain at target repository

For a combined MD Content query to be applicable the target repository has to provide a way to restrict the search domain. There are basically two ways of doing this:

extensional filter
Target repository accepts a list of Collection/Resource?-PIDs to search in. (More in SearchContext.)
In that case the Resource-PIDs can be extracted from the result of the MD-query and passed to the target repository as filter.
Example of a repository implementing such a filter is MPI-corpora. TODO: add sample link
intensional filter
Target repository is capable of restricting the search based on some metadata-filter.
In that case the MD-query can be directly passed to the target repository. If metadata about the resources is available in the repository a MD-query in the MD-Repository should still be performed to check if it has sense to send the query to given repository in the first place.
Example of a repository implementing a extensional filter is the C4-corpus. TODO: add sample link

If a repository supports both types of filter, the use of intensional seems preferable, as it avoids resolving and sending around of potentially long lists of identifiers and allows to optimize the combined query at the target repository.

CDMDC interaction PNG

In any case there has to be knowledge about the capabilities of the target repository, that the CDMDC-logic can act upon.

Linking MDRecords and Content Services

Irrespectively of the search-domain-restriction issue, we have to solve how to relate the metadata records with the (endpoints of the) repositories/search engines, so that it is possible to map from a MD-result to the candidate repositories to search in. The reference diagram (the blue one) simplifies here on two crucial points:

  1. MD-Records in Central MD-repository pointing to (describing) the Corpora.
  2. passing of the MD-result (Metadata list) to the Query content-component.

This would only allow a MD-query on the Collection/Corpus?-metadata and would require for every such MD-Record to carry a link to the corresponding endpoint. This is a baseline-scenario, dubbed below as Variant 1.

However we need

  1. MD-search in the MD-records of individual Resources,
  2. an endpoint serving multiple collections
  3. probably additional (technical) information about the endpoint

A solution for this is proposed in Variant 2 and depicted in the second diagram.

Variant 3 brings in Virtual Collections as a public (published, persisted) carrier of the selection.

Variant 1: Collection's MDRecord points to the Content Service

A baseline solution is that the Collection-MDRecord points to the endpoint of given search engine. This could be encoded either as ResourceRef? (Though we would have to distinguish (via <ResourceType>?) between the endpoint and the members of the collection, that the ProxyList? of Collection-MDRecord normally consists of.):

 <CMD>
   <ResourceProxyList>
       <ResourceProxy><ResourceType> ??Resource?Repository?? </ResourcType><ResourceRef>{URL to endpoint}</ResourceRef></ResourceProxy>
       <ResourceProxy><ResourceType>Metadata</ResourcType><ResourceRef>{handle of a collection-member}</ResourceRef></ResourceProxy>
       ...  
   </ResourceProxyList>
   <Components>
     <Collection> ...

or as a separate CMD-component:

 <CMD>
   <Components>
     <Collection>{info about collection}
     </Collection>
     <Repository>
       <URL>{URL to the endpoint }</URL>
       ...{more info about the Repository}
     </Repository>
    </Components>

Variant 2: Separate MDRecord Repository

As a (compatible) continuation of the idea of an own CMD-component, we define a separate CMD-Profile Repository (see RepositoryRegistry), that carries

  1. the URL of the Content Service endpoint
  2. references to the MDRecords (<ResourceRef>) of Collections and/or Resources that are searchable via this Service
  3. any further technical information about the service
    Use TMDC (Technical Metadata Component for describing services)?
    + examine how much can be derived from the explain-record, see FCS-FeatureMatrix

It shouldn't carry any information about the content. That should be maintained separately in the MDRecords for the referenced Collections and Resources.

The MDSearch would resolve the (transitive) IsPartOf-relation for individual MD-Records and deliver the MD-Records in the MD-result together with the endpoint of the corresponding Repository (or even with the full Repository MD-Record.)

Variant 3: Use Virtual Collection

MD-Search could be packed as a virtual collection, sending only a reference of it to the content search. The overhead of creating a VC seems justified mainly when reuse and sharing of the selection/query is intended.

Querying metadata indices

section being written

We have metadata fields in the MD-Records. These can be searched in the MDRepository. And we have metadata fields that the content search allows to query directly.

Basic example: wanting to search in Resources of a specific language. This can be simply encoded in the Collection-MDRecord, for all the member resources, so the MD-Result would be the Collection-MDRecord (with appropriate Repository attached)

Another repository could serve resources of different languages and expose language as MD-filter. Then we would pass the language-parameter to the repository. The Repository-MDrecord (or the explain-record) would inform, that it exposes given index.

How to avoid confusion? Probably try to handle this transparently for the user, meaning that SHe does not have to bother where which index applies. That means that the CDMDC-component has to do the maths (routing the MD-query correctly, mapping between equivalent indices).

Last modified 13 years ago Last modified on 06/06/11 15:09:43

Attachments (5)

Download all attachments as: .zip