Opened 8 years ago

Last modified 8 years ago

#872 assigned enhancement

vlo-beta: Combined search (search term + facet) does not order the search resualts correctly

Reported by: Jörg Knappen Owned by: teckart@informatik.uni-leipzig.de
Priority: minor Milestone: VLO-4.2 or later
Component: VLO web app Version:
Keywords: VLO 3.4-beta 2 Cc: teckart@informatik.uni-leipzig.de

Description

Compare the following two searches:

  1. Leave the search slit empty, browse all, narrow the facet "collection" to "Universität des Saarlandes CLARIN-D-Zentrum, Saarbrücken". This search gives the expected result, with the subcorpora of PolDiLemma? ranked lower than the singleton resources or the head of PolDiLemma?
  1. Enter "Saarlandes in the search slit, search, narrow the facet "collection" to "Universität des Saarlandes CLARIN-D-Zentrum, Saarbrücken". This search does NOT give the expected order of results (The result set is the same) with some singleton resources and even the head of PolDiLemma? drowned somewhere in the PolDiLemma? subcorpora.

Change History (8)

comment:1 Changed 8 years ago by DefaultCC Plugin

Cc: teckart@informatik.uni-leipzig.de added

comment:2 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Priority: majorminor

The explanation for this behaviour is that Solr's "lengthNorm" ("measure of the importance of a term according to the total number of terms in the field") prefers short documents (the subcorpora) over long documents (the main corpus) for the same number of query hits.

As this is the expected behaviour in most cases, I don't see a simple solution. We could boost _hierarchyWeight even stronger, but for this specific query it would have to compensate for almost the half weight of the full text search (0.069 vs. 0.039). Such a strong modification of the weighting strategy may lead to a lot of problematic result rankings.

comment:3 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Milestone: VLO-4.0 or later
Type: defectenhancement

comment:4 Changed 8 years ago by Jörg Knappen

Is that really the expected behaviour? We are still talking about the harvested metadata, not the corpus data themselves. It implies that resources with more detailed metadata are ranked below resources with only minimal metadata in the simple search.

comment:5 Changed 8 years ago by teckart@informatik.uni-leipzig.de

I see that point, but many (most?) IR strategies have some "document frequency" preference for documents with a higher (relative) frequency of the search terms. In this case this may be a problem, but for many other queries it may be crucial to avoid having metadata files with very verbose content (hence often containing a queried term) on the top of the result list for many queries, even if more specific (smaller) files exist.

A short look in the Solr documentation showed that the omitNorm parameter in the field's definition may be interesting to check. I'll leave this ticket open until a test import with an enabled omitNorm can be done and inspected.

comment:6 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Owner: changed from Twan Goosen to teckart@informatik.uni-leipzig.de
Status: newassigned

comment:7 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.0 or laterVLO-4.1 or later

Milestone renamed

comment:8 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.1 or laterVLO-4.2 or later

Milestone renamed

Note: See TracTickets for help on using tickets.