Ticket #12110 (closed PLIP: fixed)
Plain text searches ignore accents
Reported by: | thomasdesvenain | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | 4.3 |
Component: | General | Version: | |
Keywords: | lexicon, catalog, internationalization | Cc: | vincentfretin, terapyon, mikerhodes, ggozad, shh |
Description (last modified by thomasdesvenain) (diff)
Proposer: Thomas Desvenain Seconder: Vincent Fretin
Motivation
Most users want the search to ignore accents.
This is a question of comfort : a search on econometrie should found documents with term "économétrie".
And that would fix an issue, as most users don't use accents with upper case characters :
For example, a search on 'économétrie' doesn't found a document titled as "Econométrie"
Assumptions
We will improve plone lexicon so that it normalizes indexed and searched terms in plain text indexes (ZCTextIndex). A document with 'Econométrie' and 'économétrie' words will be indexed for 'econometrie' term. A search on 'économétrie' or 'econometrie' word will search for 'econometrie' index value.
The normalization will be made on the model of document ids generation in Plone.
To avoid performance issue and extensions at lowest level than Plone, normalization will be independent of site language.
Proposal & Implementation
We have to add a new Case Normalizer named 'I18n Case Normalizer'. This normalizer will use plone.i18n tools to generate an ascii string from any word to normalize.
Deliverables
- Code
- Add a new class in Products.CMFPlone.UnicodeSplitter, register it as 'I18n Case Normalizer'.
- plone_lexicon will use this.
- Upgrades
- Upgrade plone_lexicon with this normalizer.
- Reindex ZCTextIndex indexes.
- Unit tests
- Test search and found documents containing words 'économétrie', 'Économétrie', 'Econométrie' with criterion 'econometrie' or 'économétrie'.
- Test with criterion as 'econometr*' or 'econom?trie'
- Equivalent Unit tests with eastern language
Risks
The main risks are :
- it has to work with all languages, included eastern languages.
- check consequences on general performances.
- indexes have to be updated for backward compatibility.
- it has to work with * and ? searches
Participants
Thomas Desvenain Manadu Terada
Progress
https://github.com/tdesvenain/Products.CMFPlone
- Unit tests written
- Code written (Works with a new installation)
TODO
Upgrades
Attachments
Change History
comment:2 Changed 5 years ago by thomasdesvenain
I had a look at this. I think we should reuse some of Products.UnicodeLexicon code (especially tests and install) if plip is accepted. For normalization, the idea is to use plone.i18n algorithms and databases, and to get a language independent normalizer, working for non latin languages. Products.UnicodeLexicon has its own.
I had a quick look, and i don't understand the difference between current unicode word splitter and the one implemented in Products.UnicodeLexicon.
comment:3 Changed 5 years ago by thomasdesvenain
- Description modified (diff)
I add close packages in 'Progress' section.
comment:4 Changed 5 years ago by eleddy
approved for 4.3 - please let us know when this is ready for review!
comment:5 Changed 5 years ago by ggozad
- Cc ggozad, shh added
Hey! I am going to be the PLIP champion for this one, so let me know if you need anything...
comment:7 follow-up: ↓ 8 Changed 4 years ago by thomasdesvenain
I have implemented most but a question remains :
I just use plone.i18n's baseNormalize to normalize the string, avoiding other transforms that are strictly unuseful (and by the way broked */? search)
The question is : may i apply local transformation mappings (declared by named IIDNormalizer utilities) ? This adds questions :
- how i compute the language (user preferred language ? site default language ? etc)
- may i apply transforms for each language of the site or for only one ?
- performance issues (this adds computation of language and utility lookup - and mapping for local characters of course)
for now i simply use the baseNormalize. In my language (french), the only issue is that if i search 'œuf' i won't find 'oeuf' or 'OEUF', and vice versa, which is a VERY rare usecase. but for other languages it may restrict the advantages of this plip ?
comment:8 in reply to: ↑ 7 Changed 4 years ago by ggozad
Replying to thomasdesvenain:
I just use plone.i18n's baseNormalize to normalize the string, avoiding other transforms that are strictly unuseful (and by the way broked */? search)
baseNormalize will wipe anything you pass to it that is non-latin (say greek or I guess any asian language).
The question is : may i apply local transformation mappings (declared by named IIDNormalizer utilities) ?
I guess you have to. In any case, it will have to work with any language not just latin-based ones.
- how i compute the language (user preferred language ? site default language ? etc)
In my opinion you would have to take into account the content/site and Linguaplone.
- performance issues (this adds computation of language and utility lookup - and mapping for local characters of course)
It would be great if you could come up with simple performance tests for "small" as well as "large" texts.
for now i simply use the baseNormalize. In my language (french), the only issue is that if i search 'œuf' i won't find 'oeuf' or 'OEUF', and vice versa, which is a VERY rare usecase. but for other languages it may restrict the advantages of this plip ?
I guess this really depends on the language. There are languages (say Norwegian) where ae is really not a good substitute for æ ;). If these could be implemented as some adapter of the language it would be better.
In any case, if I may suggest, it would be great to involve people who use and work with more exotic languages as much as possible.
comment:9 Changed 4 years ago by eleddy
ggozad is taking a break for a month and I (eleddy) will be taking lead on this for now. PLease let me know if you have more questions. Looking to see this ready for review first week in January!
comment:10 Changed 4 years ago by eleddy
Is this plip ready for review by chance?
comment:11 Changed 4 years ago by thomasdesvenain
Yes, you can review this PLIP ! (I was waiting for a review by a colleague before notfying it, but he can't do it now.)
Thanks !
comment:12 Changed 4 years ago by giacomos
- severity set to Untriaged
comment:13 Changed 4 years ago by thomasdesvenain
- Keywords lexiconcataloginternationalization added; lexicon catalog internationalization removed
- Status changed from new to closed
- Resolution set to fixed
Hi Giacomo,
I have made a commit in plone.app.upgrade that fixes the 2 errors in tests. I checked that v3.upgradeToI18NCaseNormalizer upgrade step is actually covered by tests (test_upgrades.testDoUpgrades)
The errors in tests testNormalizeLatin1 and testProcessLatin1 are not related to my code. When i revert my code, i always get them. (I have tried to fix those errors by the way, but without success...)
PS : your review has disappeared from buildout.coredev repository
comment:14 Changed 4 years ago by thomasdesvenain
- Status changed from closed to confirmed
I also added a specific test for this upgrade step (testUpgradeToI18NCaseNormalizer)
comment:15 Changed 4 years ago by thomasdesvenain
Hi giacomo, i have fixed the test. indeed it didn't pass with an up-to-date Products.CMFPlone. thanks
comment:16 Changed 4 years ago by eleddy
thanks for the work here - feel free to merge into the new branch
comment:17 Changed 4 years ago by davisagli
In case it wasn't clear from Liz's comment, the framework team approved this PLIP to be merged for Plone 4.3 in our meeting on April 24. Do you know when you'll be able to work on that? Please feel free to ask me questions if you aren't sure how to go about doing that.
comment:18 Changed 4 years ago by eleddy
hey guys - we are looking to see this merged by June 30th. If you have time and can merge that would be great. Otherwise maybe one of the FWT members will step in an merge.
comment:19 Changed 4 years ago by kleist
- Keywords lexicon, catalog, internationalization added; lexiconcataloginternationalization removed
- Component changed from Infrastructure to General
Please note that Products.UnicodeLexicon implements all of this already. And it works just fine with Plone.
http://pypi.python.org/pypi/Products.UnicodeLexicon