Opened 8 years ago
Closed 8 years ago
#906 closed defect (fixed)
Performance issues due to boosting
Reported by: | Twan Goosen | Owned by: | teckart@informatik.uni-leipzig.de |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | VLO web app | Version: | |
Keywords: | Cc: | teckart@informatik.uni-leipzig.de |
Description
The recently (VLO 3.4.0) added boosts (in solrconfig.xml
) based on availability values combined with the previously introduced boosts (hierarchy weight, recency, presence of name, description, parts/children) result in a bad querying performance. Replace with the calculation of a boost score at import/index time, either through the built-in facilities, or via a custom field. The Solr documentation seems to suggest that the former is preferable:
Using Field and/or Document boosts has been supported since the very early days of Lucene, but is some what limiting and antiquated at this point, and instead people should strongly consider indexing their boost values as numeric fields instead (see next section)
In any case, SolrJ has methods to set a 'boost' value on documents or fields, which can be utilised.
Change History (9)
comment:1 Changed 8 years ago by
Cc: | teckart@informatik.uni-leipzig.de added |
---|
comment:2 Changed 8 years ago by
comment:3 Changed 8 years ago by
Leaving out log(add(1,min(50,_hasPartCount)))^.2
also seems to have a noticeable impact, although not as much as recip(ms(NOW,_lastSeen),3.16e-11,1,1)^.2
.
The former can be calculated at import time, which should improve the runtime performance (boost should be based on a simple, indexable numerical value). For the latter we can probably come up with an alternative that performs well e.g. by removing or reducing the complexity of the calculation.
comment:4 follow-up: 5 Changed 8 years ago by
I replaced (in a local branch) the _hasPartCount boost with a new field _hasPartCountWeight, that is filled during import time (Boosting via _hasPartCountWeight^.2
).
The second one is the "official" approach to boost more recent documents (Date Boosting). We could use a reduced precision (like recip(ms(NOW/DAY,_lastSeen),3.16e-11,1,1)
) as proposed in the Solr Wiki. Alternatively, something like log(MAX_DAYS_IN_SOLR_IN_MILLISECONDS - ms(NOW, _lastSeen))^.2
may also be sufficient.
Remarks or other ideas?
comment:5 follow-up: 6 Changed 8 years ago by
Replying to teckart@…:
I replaced (in a local branch) the _hasPartCount boost with a new field _hasPartCountWeight, that is filled during import time (Boosting via
_hasPartCountWeight^.2
).
That sounds like a good solution. I assume that _hasPartCountWeight = log(add(1,min(50,_hasPartCount)))
?
The second one is the "official" approach to boost more recent documents (Date Boosting). We could use a reduced precision (like
recip(ms(NOW/DAY,_lastSeen),3.16e-11,1,1)
) as proposed in the Solr Wiki. Alternatively, something likelog(MAX_DAYS_IN_SOLR_IN_MILLISECONDS - ms(NOW, _lastSeen))^.2
may also be sufficient.
Remarks or other ideas?
Since we don't really care about an exact ordering based on how old records are but just want to prevent older records from showing up first (the default behaviour), maybe we can make boolean a field that gets set depending on whether the file was seen during the last harvest and boost on basis of that?
comment:6 follow-up: 7 Changed 8 years ago by
Replying to twan.goosen@…:
That sounds like a good solution. I assume that
_hasPartCountWeight = log(add(1,min(50,_hasPartCount)))
?
Yes (Math.log10(1 + Math.min(50, incomingVertexNames.size())
to be specific).
The second one is the "official" approach to boost more recent documents (Date Boosting). We could use a reduced precision (like
recip(ms(NOW/DAY,_lastSeen),3.16e-11,1,1)
) as proposed in the Solr Wiki. Alternatively, something likelog(MAX_DAYS_IN_SOLR_IN_MILLISECONDS - ms(NOW, _lastSeen))^.2
may also be sufficient.
Remarks or other ideas?
Since we don't really care about an exact ordering based on how old records are but just want to prevent older records from showing up first (the default behaviour), maybe we can make boolean a field that gets set depending on whether the file was seen during the last harvest and boost on basis of that?
We can do that, but that means that we need a complete iteration over all Solr records to modify old ones (set them to wasSeenAtLastImport=false). No problem to implement that, but it is an additional procedure before the normal import takes place. Nonetheless I would like to implement it (and have a look on the import time) as it's probably the safest solution and may spare us a lot of time evaluating the impact of different boosting functions in Solr.
comment:7 Changed 8 years ago by
Replying to teckart@…:
Replying to twan.goosen@…:
Since we don't really care about an exact ordering based on how old records are but just want to prevent older records from showing up first (the default behaviour), maybe we can make boolean a field that gets set depending on whether the file was seen during the last harvest and boost on basis of that?
We can do that, but that means that we need a complete iteration over all Solr records to modify old ones (set them to wasSeenAtLastImport=false). No problem to implement that, but it is an additional procedure before the normal import takes place. Nonetheless I would like to implement it (and have a look on the import time) as it's probably the safest solution and may spare us a lot of time evaluating the impact of different boosting functions in Solr.
I agree.
To make it a bit more sophisticated, or at least closer to what we have now, instead of a boolean field, we could have an integer field with the number of days since 'last seen' at import time and boost on basis of the inverse (rord
) of that. I assume there will be some performance difference between the two solutions, but probably small.
comment:8 Changed 8 years ago by
Owner: | changed from Twan Goosen to teckart@informatik.uni-leipzig.de |
---|---|
Status: | new → assigned |
Implemented and pushed to Github as branch ticket906.
comment:9 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Integrated (with some adaptations) into release-3.4.1 branch soon to be deployed in beta for testing.
It turns out that the last part of
is the main offender - apparently the calculation is rather expensive voor hundreds of thousands of records. We should try to think of an alternative.