Opened 10 years ago

Last modified 7 years ago

#545 assigned defect

The search function should not strip diacritics

Reported by: Jörg Knappen Owned by: Twan Goosen
Priority: minor Milestone: VLO-4.2 or later
Component: VLO web app Version:
Keywords: Cc:

Description

Currently, searching for "ḍād" and "dad" gives the same number of hits. This is not the behaviour I am expecting: When I give the diacritics, I expect results for the text with diacritics only (or at least, with the relevant things with diacritics first).

Interestingly, this behaviour is not consistent at all: "ol" and "öl" give different hit lists.

Next test: "étude" and "etude" give same hit lists.

Change History (14)

comment:1 Changed 10 years ago by Dieter Van Uytvanck

Component: AAIVLO web app

changing component: VLO instead of AAI

comment:2 Changed 9 years ago by Twan Goosen

Could be implemented as an option in the advanced search form (see #555)

comment:3 Changed 9 years ago by Twan Goosen

Summary: [vlo 3.0 beta] The search function should not stripe diacriticsThe search function should not strip diacritics

cleaned up ticket summary

comment:4 Changed 9 years ago by Twan Goosen

Milestone: VLO-3.3

Moved a number of existing VLO tickets to 3.3 milestone

comment:5 Changed 9 years ago by Twan Goosen

Priority: majorminor

Looks like we have the same requirement as the topic starter of this discussion.

Unfortunately I have not been able to find any straightforward ways of modulating the behaviour query time (e.g. tell Solr to fold or not fold characters), only at time of index. The best that can be achieved relatively easily seems to be to index both the folded and unfolded values so that a search with diacritics should not match the folded values (but not the other way around).

For an overview of the options at index time, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

Here's a related and possibly useful unicode report I stumbled upon.

comment:6 Changed 9 years ago by Twan Goosen

Milestone: VLO-3.3VLO-3.4

comment:7 Changed 9 years ago by Twan Goosen

Owner: set to Twan Goosen
Status: newassigned

comment:8 Changed 9 years ago by go.sugimoto@oeaw.ac.at

I think this is quite a tricky area of business.
Even Google seems to implement complex configurations, depending on the user interface (browser language) and the location (Google.com, Google.de etc) .(See what the webmaster said http://googlewebmastercentral.blogspot.co.at/2006/08/how-search-results-may-differ-based-on.html)

As every user has different expectations for diacritics, what I would suggest is not to solve this only by the simple search, but to provide the advanced search options. They should also include specific fields search etc. As no users care about the algorithm behind the simple search engine, this is the most understandable solution.

Perhaps, some operators in the simple search (e.g. +, -, AND, OR etc) would be helpful, but just an option.

comment:9 Changed 9 years ago by Twan Goosen

Thanks for the pointer to Google's report on this. I agree that this cannot be solved easily in simple search, and possibly have a special syntax or option to do a 'strict' search. However, this also needs to be supported by Solr as it has implications for both the indexing and the querying machinery.

Advanced search has been implemented in the upcoming 3.3 version of the VLO (currently in beta at http://beta-vlo.clarin.eu) as described in #762.

comment:10 Changed 9 years ago by Twan Goosen

Milestone: VLO-3.4VLO-3.5 or later

Decided to be put on '3.5 or later' milestone in developer video conference

comment:11 Changed 8 years ago by Twan Goosen

Milestone: VLO-3.5 or laterVLO-4.0 or later

Milestone renamed

comment:12 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.0 or laterVLO-4.1 or later

Milestone renamed

comment:13 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.1 or laterVLO-4.2 or later

Milestone renamed

comment:14 Changed 7 years ago by Twan Goosen

Owner: changed from Twan Goosen to Twan Goosen
Note: See TracTickets for help on using tickets.