Opened 9 years ago
Last modified 8 years ago
#872 assigned enhancement
vlo-beta: Combined search (search term + facet) does not order the search resualts correctly
Reported by: | Jörg Knappen | Owned by: | teckart@informatik.uni-leipzig.de |
---|---|---|---|
Priority: | minor | Milestone: | VLO-4.2 or later |
Component: | VLO web app | Version: | |
Keywords: | VLO 3.4-beta 2 | Cc: | teckart@informatik.uni-leipzig.de |
Description
Compare the following two searches:
- Leave the search slit empty, browse all, narrow the facet "collection" to "Universität des Saarlandes CLARIN-D-Zentrum, Saarbrücken". This search gives the expected result, with the subcorpora of PolDiLemma? ranked lower than the singleton resources or the head of PolDiLemma?
- Enter "Saarlandes in the search slit, search, narrow the facet "collection" to "Universität des Saarlandes CLARIN-D-Zentrum, Saarbrücken". This search does NOT give the expected order of results (The result set is the same) with some singleton resources and even the head of PolDiLemma? drowned somewhere in the PolDiLemma? subcorpora.
Change History (8)
comment:1 Changed 9 years ago by
Cc: | teckart@informatik.uni-leipzig.de added |
---|
comment:2 Changed 9 years ago by
Priority: | major → minor |
---|
comment:3 Changed 9 years ago by
Milestone: | → VLO-4.0 or later |
---|---|
Type: | defect → enhancement |
comment:4 Changed 9 years ago by
Is that really the expected behaviour? We are still talking about the harvested metadata, not the corpus data themselves. It implies that resources with more detailed metadata are ranked below resources with only minimal metadata in the simple search.
comment:5 Changed 9 years ago by
I see that point, but many (most?) IR strategies have some "document frequency" preference for documents with a higher (relative) frequency of the search terms. In this case this may be a problem, but for many other queries it may be crucial to avoid having metadata files with very verbose content (hence often containing a queried term) on the top of the result list for many queries, even if more specific (smaller) files exist.
A short look in the Solr documentation showed that the omitNorm parameter in the field's definition may be interesting to check. I'll leave this ticket open until a test import with an enabled omitNorm can be done and inspected.
comment:6 Changed 9 years ago by
Owner: | changed from Twan Goosen to teckart@informatik.uni-leipzig.de |
---|---|
Status: | new → assigned |
The explanation for this behaviour is that Solr's "lengthNorm" ("measure of the importance of a term according to the total number of terms in the field") prefers short documents (the subcorpora) over long documents (the main corpus) for the same number of query hits.
As this is the expected behaviour in most cases, I don't see a simple solution. We could boost _hierarchyWeight even stronger, but for this specific query it would have to compensate for almost the half weight of the full text search (0.069 vs. 0.039). Such a strong modification of the weighting strategy may lead to a lot of problematic result rankings.