Opened 8 years ago

Closed 8 years ago

#906 closed defect (fixed)

Performance issues due to boosting

Reported by: Twan Goosen Owned by: teckart@informatik.uni-leipzig.de
Priority: critical Milestone:
Component: VLO web app Version:
Keywords: Cc: teckart@informatik.uni-leipzig.de

Description

The recently (VLO 3.4.0) added boosts (in solrconfig.xml) based on availability values combined with the previously introduced boosts (hierarchy weight, recency, presence of name, description, parts/children) result in a bad querying performance. Replace with the calculation of a boost score at import/index time, either through the built-in facilities, or via a custom field. The Solr documentation seems to suggest that the former is preferable:

Using Field and/or Document boosts has been supported since the very early days of Lucene, but is some what limiting and antiquated at this point, and instead people should strongly consider indexing their boost values as numeric fields instead (see next section)

In any case, SolrJ has methods to set a 'boost' value on documents or fields, which can be utilised.

Change History (9)

comment:1 Changed 8 years ago by DefaultCC Plugin

Cc: teckart@informatik.uni-leipzig.de added

comment:2 Changed 8 years ago by Twan Goosen

It turns out that the last part of

        <str name="bf">
            rord(_hierarchyWeight) log(add(1,min(50,_hasPartCount)))^.2 recip(ms(NOW,_lastSeen),3.16e-11,1,1)^.2
        </str>

is the main offender - apparently the calculation is rather expensive voor hundreds of thousands of records. We should try to think of an alternative.

comment:3 Changed 8 years ago by Twan Goosen

Leaving out log(add(1,min(50,_hasPartCount)))^.2 also seems to have a noticeable impact, although not as much as recip(ms(NOW,_lastSeen),3.16e-11,1,1)^.2.

The former can be calculated at import time, which should improve the runtime performance (boost should be based on a simple, indexable numerical value). For the latter we can probably come up with an alternative that performs well e.g. by removing or reducing the complexity of the calculation.

comment:4 Changed 8 years ago by teckart@informatik.uni-leipzig.de

I replaced (in a local branch) the _hasPartCount boost with a new field _hasPartCountWeight, that is filled during import time (Boosting via _hasPartCountWeight^.2).

The second one is the "official" approach to boost more recent documents (Date Boosting). We could use a reduced precision (like recip(ms(NOW/DAY,_lastSeen),3.16e-11,1,1)) as proposed in the Solr Wiki. Alternatively, something like log(MAX_DAYS_IN_SOLR_IN_MILLISECONDS - ms(NOW, _lastSeen))^.2 may also be sufficient.

Remarks or other ideas?

comment:5 in reply to:  4 ; Changed 8 years ago by Twan Goosen

Replying to teckart@…:

I replaced (in a local branch) the _hasPartCount boost with a new field _hasPartCountWeight, that is filled during import time (Boosting via _hasPartCountWeight^.2).

That sounds like a good solution. I assume that _hasPartCountWeight = log(add(1,min(50,_hasPartCount)))?

The second one is the "official" approach to boost more recent documents (Date Boosting). We could use a reduced precision (like recip(ms(NOW/DAY,_lastSeen),3.16e-11,1,1)) as proposed in the Solr Wiki. Alternatively, something like log(MAX_DAYS_IN_SOLR_IN_MILLISECONDS - ms(NOW, _lastSeen))^.2 may also be sufficient.

Remarks or other ideas?

Since we don't really care about an exact ordering based on how old records are but just want to prevent older records from showing up first (the default behaviour), maybe we can make boolean a field that gets set depending on whether the file was seen during the last harvest and boost on basis of that?

comment:6 in reply to:  5 ; Changed 8 years ago by teckart@informatik.uni-leipzig.de

Replying to twan.goosen@…:

That sounds like a good solution. I assume that _hasPartCountWeight = log(add(1,min(50,_hasPartCount)))?

Yes (Math.log10(1 + Math.min(50, incomingVertexNames.size()) to be specific).

The second one is the "official" approach to boost more recent documents (Date Boosting). We could use a reduced precision (like recip(ms(NOW/DAY,_lastSeen),3.16e-11,1,1)) as proposed in the Solr Wiki. Alternatively, something like log(MAX_DAYS_IN_SOLR_IN_MILLISECONDS - ms(NOW, _lastSeen))^.2 may also be sufficient.

Remarks or other ideas?

Since we don't really care about an exact ordering based on how old records are but just want to prevent older records from showing up first (the default behaviour), maybe we can make boolean a field that gets set depending on whether the file was seen during the last harvest and boost on basis of that?

We can do that, but that means that we need a complete iteration over all Solr records to modify old ones (set them to wasSeenAtLastImport=false). No problem to implement that, but it is an additional procedure before the normal import takes place. Nonetheless I would like to implement it (and have a look on the import time) as it's probably the safest solution and may spare us a lot of time evaluating the impact of different boosting functions in Solr.

comment:7 in reply to:  6 Changed 8 years ago by Twan Goosen

Replying to teckart@…:

Replying to twan.goosen@…:

Since we don't really care about an exact ordering based on how old records are but just want to prevent older records from showing up first (the default behaviour), maybe we can make boolean a field that gets set depending on whether the file was seen during the last harvest and boost on basis of that?

We can do that, but that means that we need a complete iteration over all Solr records to modify old ones (set them to wasSeenAtLastImport=false). No problem to implement that, but it is an additional procedure before the normal import takes place. Nonetheless I would like to implement it (and have a look on the import time) as it's probably the safest solution and may spare us a lot of time evaluating the impact of different boosting functions in Solr.

I agree.

To make it a bit more sophisticated, or at least closer to what we have now, instead of a boolean field, we could have an integer field with the number of days since 'last seen' at import time and boost on basis of the inverse (rord) of that. I assume there will be some performance difference between the two solutions, but probably small.

comment:8 Changed 8 years ago by teckart@informatik.uni-leipzig.de

Owner: changed from Twan Goosen to teckart@informatik.uni-leipzig.de
Status: newassigned

Implemented and pushed to Github as branch ticket906.

comment:9 Changed 8 years ago by Twan Goosen

Resolution: fixed
Status: assignedclosed

Integrated (with some adaptations) into release-3.4.1 branch soon to be deployed in beta for testing.

Note: See TracTickets for help on using tickets.