Opened 10 years ago
Last modified 7 years ago
#545 assigned defect
The search function should not strip diacritics
Reported by: | Jörg Knappen | Owned by: | Twan Goosen |
---|---|---|---|
Priority: | minor | Milestone: | VLO-4.2 or later |
Component: | VLO web app | Version: | |
Keywords: | Cc: |
Description
Currently, searching for "ḍād" and "dad" gives the same number of hits. This is not the behaviour I am expecting: When I give the diacritics, I expect results for the text with diacritics only (or at least, with the relevant things with diacritics first).
Interestingly, this behaviour is not consistent at all: "ol" and "öl" give different hit lists.
Next test: "étude" and "etude" give same hit lists.
Change History (14)
comment:1 Changed 10 years ago by
Component: | AAI → VLO web app |
---|
comment:2 Changed 9 years ago by
Could be implemented as an option in the advanced search form (see #555)
comment:3 Changed 9 years ago by
Summary: | [vlo 3.0 beta] The search function should not stripe diacritics → The search function should not strip diacritics |
---|
cleaned up ticket summary
comment:4 Changed 9 years ago by
Milestone: | → VLO-3.3 |
---|
Moved a number of existing VLO tickets to 3.3 milestone
comment:5 Changed 9 years ago by
Priority: | major → minor |
---|
Looks like we have the same requirement as the topic starter of this discussion.
Unfortunately I have not been able to find any straightforward ways of modulating the behaviour query time (e.g. tell Solr to fold or not fold characters), only at time of index. The best that can be achieved relatively easily seems to be to index both the folded and unfolded values so that a search with diacritics should not match the folded values (but not the other way around).
For an overview of the options at index time, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
Here's a related and possibly useful unicode report I stumbled upon.
comment:6 Changed 9 years ago by
Milestone: | VLO-3.3 → VLO-3.4 |
---|
comment:7 Changed 9 years ago by
Owner: | set to Twan Goosen |
---|---|
Status: | new → assigned |
comment:8 Changed 9 years ago by
I think this is quite a tricky area of business.
Even Google seems to implement complex configurations, depending on the user interface (browser language) and the location (Google.com, Google.de etc) .(See what the webmaster said http://googlewebmastercentral.blogspot.co.at/2006/08/how-search-results-may-differ-based-on.html)
As every user has different expectations for diacritics, what I would suggest is not to solve this only by the simple search, but to provide the advanced search options. They should also include specific fields search etc. As no users care about the algorithm behind the simple search engine, this is the most understandable solution.
Perhaps, some operators in the simple search (e.g. +, -, AND, OR etc) would be helpful, but just an option.
comment:9 Changed 9 years ago by
Thanks for the pointer to Google's report on this. I agree that this cannot be solved easily in simple search, and possibly have a special syntax or option to do a 'strict' search. However, this also needs to be supported by Solr as it has implications for both the indexing and the querying machinery.
Advanced search has been implemented in the upcoming 3.3 version of the VLO (currently in beta at http://beta-vlo.clarin.eu) as described in #762.
comment:10 Changed 9 years ago by
Milestone: | VLO-3.4 → VLO-3.5 or later |
---|
Decided to be put on '3.5 or later' milestone in developer video conference
comment:14 Changed 7 years ago by
Owner: | changed from Twan Goosen to Twan Goosen |
---|
changing component: VLO instead of AAI