= Combined Distributed Metadata Content Search = (The related chapters in the [[source:FederatedSearch/docs/FederatedSearch.docx |FederatedSearch doc]] are ''4.6 Combined Metadata and Content Query'' and ''6.3 Combined Metadata Content Search''.) A combined metadata content search shall allow to restrict the context of the content search by metadata filters. Simple examples: {{{ Find all occurrences where a female Actor said “Ja”. Find all occurrences of “viable system” in texts where the tile contains "architecture" and the Organisation responsible for the collection the text is part of is a university. }}} Generally it should be possible to formulate the whole query (both metadata and content part) in one query string (with the same CQL-syntax): {{{ Actor.Sex=f AND text = “Ja” text = “viable system” AND title = architecture AND Organisation ISA University. }}} There may also be an issue with such combined queries, that it is not clearly decidable, which index is handled by which component (metadata or content). This is further dicussed under ''intensional filter'': [[Image(EDC_combinedquery.png,650)]] [[Image(edc_interaction_mks.png, 600, right)]] The process can be basically divided into two phases: 1. '''metadata search''' – restricts the candidate collections to search in, based on metadata part of the query. It returns a list of candidates, which is used in: 2. '''content search''' – by the federated content search, to iterate through and issue the content query to each candidate in turn. However there are (at least) two issues wrt to the "candidates" (that is in the interaction between the metadata and content components): * The ability of the target repository to restrict the domain of the content search * Linkage between the MDRecord and the Repository that provides corresponding resource These two are discussed in following sections: == Restricting the search domain at target repository == For a combined MD Content query to be applicable the target repository has to provide a way to restrict the search domain. There are basically two ways of doing this: extensional filter :: Target repository accepts a list of Collection/Resource-PIDs to search in. (More in [[SearchContext]].) [[BR]] In that case the Resource-PIDs can be extracted from the result of the MD-query and passed to the target repository as filter. [[BR]] Example of a repository implementing such a filter is MPI-corpora. '''''TODO:''' add sample link'' intensional filter :: Target repository is capable of restricting the search based on some metadata-filter.[[BR]] In that case the MD-query can be directly passed to the target repository. If metadata about the resources is available in the repository a MD-query in the MD-Repository should still be performed to check if it has sense to send the query to given repository in the first place. [[BR]] Example of a repository implementing a extensional filter is the C4-corpus. '''''TODO:''' add sample link'' If a repository supports both types of filter, the use of intensional seems preferable, as it avoids resolving and sending around of potentially long lists of identifiers and allows to optimize the combined query at the target repository. [[Image(CDMDC_interaction.png,right,600)]] In any case there has to be knowledge about the capabilities of the target repository, that the CDMDC-logic can act upon. == Linking MDRecords and Content Services == Irrespectively of the search-domain-restriction issue, we have to solve '''how to relate the metadata records with the (endpoints of the) repositories/search engines''', so that it is possible to map from a MD-result to the candidate repositories to search in. The reference diagram (the blue one) simplifies here on two crucial points: 1. MD-Records in Central MD-repository pointing to (describing) the Corpora. 2. passing of the MD-result (Metadata list) to the `Query content`-component. This would only allow a MD-query on the Collection/Corpus-metadata and would require for every such MD-Record to carry a link to the corresponding endpoint. This is a baseline-scenario, dubbed below as ''Variant 1''. However we need a. MD-search in the MD-records of individual Resources, a. an endpoint serving multiple collections a. probably additional (technical) information about the endpoint A solution for this is proposed in ''Variant 2'' and depicted in the second diagram. ''Variant 3'' brings in Virtual Collections as a public (published, persisted) carrier of the selection. === Variant 1: Collection's MDRecord points to the Content Service === A baseline solution is that the Collection-MDRecord points to the endpoint of given search engine. This could be encoded either as ResourceRef (Though we would have to distinguish (via ``?) between the endpoint and the members of the collection, that the ProxyList of Collection-MDRecord normally consists of.): {{{ ??Resource?Repository?? {URL to endpoint} Metadata{handle of a collection-member} ... ... }}} or as a separate CMD-component: {{{ {info about collection} {URL to the endpoint } ...{more info about the Repository} }}} === Variant 2: Separate MDRecord `Repository`=== As a (compatible) continuation of the idea of an own CMD-component, we define a separate CMD-Profile `Repository` (see [[RepositoryRegistry#IdeasonimplementationoftheRegistry|RepositoryRegistry]]), that carries 1. the URL of the Content Service endpoint 1. references to the MDRecords (``) of Collections and/or Resources that are searchable via this Service 1. any further technical information about the service[[BR]] Use '''TMDC''' (Technical Metadata Component for describing services)?[[BR]] + examine how much can be derived from the `explain`-record, see [[FCS-FeatureMatrix]] It shouldn't carry any information about the content. That should be maintained separately in the MDRecords for the referenced Collections and Resources. The MDSearch would resolve the (transitive) `IsPartOf`-relation for individual MD-Records and deliver the MD-Records in the MD-result together with the endpoint of the corresponding Repository (or even with the full Repository MD-Record.) === Variant 3: Use Virtual Collection === MD-Search could be packed as a virtual collection, sending only a reference of it to the content search. The overhead of creating a VC seems justified mainly when reuse and sharing of the selection/query is intended. == Querying metadata indices == ''section being written'' We have metadata fields in the MD-Records. These can be searched in the MDRepository. And we have metadata fields that the content search allows to query directly. Basic example: wanting to search in Resources of a specific language. This can be simply encoded in the Collection-MDRecord, for all the member resources, so the MD-Result would be the Collection-MDRecord (with appropriate Repository attached) Another repository could serve resources of different languages and expose `language` as MD-filter. Then we would pass the language-parameter to the repository. The Repository-MDrecord (or the `explain`-record) would inform, that it exposes given index. How to avoid confusion? Probably try to handle this transparently for the user, meaning that SHe does not have to bother where which index applies. That means that the CDMDC-component has to do the maths (routing the MD-query correctly, mapping between equivalent indices).