Ticket #12110 (closed PLIP: fixed)

Opened 5 years ago

Last modified 4 years ago

Plain text searches ignore accents

Reported by: thomasdesvenain Owned by:
Priority: minor Milestone: 4.3
Component: General Version:
Keywords: lexicon, catalog, internationalization Cc: vincentfretin, terapyon, mikerhodes, ggozad, shh

Description (last modified by thomasdesvenain) (diff)

Proposer: Thomas Desvenain Seconder: Vincent Fretin

Motivation

Most users want the search to ignore accents.

This is a question of comfort : a search on econometrie should found documents with term "économétrie".

And that would fix an issue, as most users don't use accents with upper case characters :

For example, a search on 'économétrie' doesn't found a document titled as "Econométrie"

Assumptions

We will improve plone lexicon so that it normalizes indexed and searched terms in plain text indexes (ZCTextIndex). A document with 'Econométrie' and 'économétrie' words will be indexed for 'econometrie' term. A search on 'économétrie' or 'econometrie' word will search for 'econometrie' index value.

The normalization will be made on the model of document ids generation in Plone.

To avoid performance issue and extensions at lowest level than Plone, normalization will be independent of site language.

Proposal & Implementation

We have to add a new Case Normalizer named 'I18n Case Normalizer'. This normalizer will use plone.i18n tools to generate an ascii string from any word to normalize.

Deliverables

  • Code
  • Add a new class in Products.CMFPlone.UnicodeSplitter, register it as 'I18n Case Normalizer'.
  • plone_lexicon will use this.
  • Upgrades
  • Upgrade plone_lexicon with this normalizer.
  • Reindex ZCTextIndex indexes.
  • Unit tests
  • Test search and found documents containing words 'économétrie', 'Économétrie', 'Econométrie' with criterion 'econometrie' or 'économétrie'.
  • Test with criterion as 'econometr*' or 'econom?trie'
  • Equivalent Unit tests with eastern language

Risks

The main risks are :

  • it has to work with all languages, included eastern languages.
  • check consequences on general performances.
  • indexes have to be updated for backward compatibility.
  • it has to work with * and ? searches

Participants

Thomas Desvenain Manadu Terada

Progress

 https://github.com/tdesvenain/Products.CMFPlone

  • Unit tests written
  • Code written (Works with a new installation)

TODO

Upgrades

Attachments

blue1.jpg Download (65.6 KB) - added by sammy888 10 months ago.
 http://kdkraftupwnkitchen.tumblr.com/

Change History

comment:1 Changed 5 years ago by shh

Please note that Products.UnicodeLexicon implements all of this already. And it works just fine with Plone.

 http://pypi.python.org/pypi/Products.UnicodeLexicon

comment:2 Changed 5 years ago by thomasdesvenain

I had a look at this. I think we should reuse some of Products.UnicodeLexicon code (especially tests and install) if plip is accepted. For normalization, the idea is to use plone.i18n algorithms and databases, and to get a language independent normalizer, working for non latin languages. Products.UnicodeLexicon has its own.

I had a quick look, and i don't understand the difference between current unicode word splitter and the one implemented in Products.UnicodeLexicon.

comment:3 Changed 5 years ago by thomasdesvenain

  • Description modified (diff)

I add close packages in 'Progress' section.

comment:4 Changed 5 years ago by eleddy

approved for 4.3 - please let us know when this is ready for review!

comment:5 Changed 5 years ago by ggozad

  • Cc ggozad, shh added

Hey! I am going to be the PLIP champion for this one, so let me know if you need anything...

comment:6 Changed 4 years ago by thomasdesvenain

  • Description modified (diff)

comment:7 follow-up: ↓ 8 Changed 4 years ago by thomasdesvenain

I have implemented most but a question remains :

I just use plone.i18n's baseNormalize to normalize the string, avoiding other transforms that are strictly unuseful (and by the way broked */? search)

The question is : may i apply local transformation mappings (declared by named IIDNormalizer utilities) ? This adds questions :

  • how i compute the language (user preferred language ? site default language ? etc)
  • may i apply transforms for each language of the site or for only one ?
  • performance issues (this adds computation of language and utility lookup - and mapping for local characters of course)

for now i simply use the baseNormalize. In my language (french), the only issue is that if i search 'œuf' i won't find 'oeuf' or 'OEUF', and vice versa, which is a VERY rare usecase. but for other languages it may restrict the advantages of this plip ?

comment:8 in reply to: ↑ 7 Changed 4 years ago by ggozad

Replying to thomasdesvenain:

I just use plone.i18n's baseNormalize to normalize the string, avoiding other transforms that are strictly unuseful (and by the way broked */? search)

baseNormalize will wipe anything you pass to it that is non-latin (say greek or I guess any asian language).

The question is : may i apply local transformation mappings (declared by named IIDNormalizer utilities) ?

I guess you have to. In any case, it will have to work with any language not just latin-based ones.

  • how i compute the language (user preferred language ? site default language ? etc)

In my opinion you would have to take into account the content/site and Linguaplone.

  • performance issues (this adds computation of language and utility lookup - and mapping for local characters of course)

It would be great if you could come up with simple performance tests for "small" as well as "large" texts.

for now i simply use the baseNormalize. In my language (french), the only issue is that if i search 'œuf' i won't find 'oeuf' or 'OEUF', and vice versa, which is a VERY rare usecase. but for other languages it may restrict the advantages of this plip ?

I guess this really depends on the language. There are languages (say Norwegian) where ae is really not a good substitute for æ ;). If these could be implemented as some adapter of the language it would be better.

In any case, if I may suggest, it would be great to involve people who use and work with more exotic languages as much as possible.

comment:9 Changed 4 years ago by eleddy

ggozad is taking a break for a month and I (eleddy) will be taking lead on this for now. PLease let me know if you have more questions. Looking to see this ready for review first week in January!

comment:10 Changed 4 years ago by eleddy

Is this plip ready for review by chance?

comment:11 Changed 4 years ago by thomasdesvenain

Yes, you can review this PLIP ! (I was waiting for a review by a colleague before notfying it, but he can't do it now.)

Thanks !

comment:12 Changed 4 years ago by giacomos

  • severity set to Untriaged

comment:13 Changed 4 years ago by thomasdesvenain

  • Keywords lexiconcataloginternationalization added; lexicon catalog internationalization removed
  • Status changed from new to closed
  • Resolution set to fixed

Hi Giacomo,

I have made a commit in plone.app.upgrade that fixes the 2 errors in tests. I checked that v3.upgradeToI18NCaseNormalizer upgrade step is actually covered by tests (test_upgrades.testDoUpgrades)

The errors in tests testNormalizeLatin1 and testProcessLatin1 are not related to my code. When i revert my code, i always get them. (I have tried to fix those errors by the way, but without success...)

PS : your review has disappeared from buildout.coredev repository

comment:14 Changed 4 years ago by thomasdesvenain

  • Status changed from closed to confirmed

I also added a specific test for this upgrade step (testUpgradeToI18NCaseNormalizer)

comment:15 Changed 4 years ago by thomasdesvenain

Hi giacomo, i have fixed the test. indeed it didn't pass with an up-to-date Products.CMFPlone. thanks

comment:16 Changed 4 years ago by eleddy

thanks for the work here - feel free to merge into the new branch

comment:17 Changed 4 years ago by davisagli

In case it wasn't clear from Liz's comment, the framework team approved this PLIP to be merged for Plone 4.3 in our meeting on April 24. Do you know when you'll be able to work on that? Please feel free to ask me questions if you aren't sure how to go about doing that.

comment:18 Changed 4 years ago by eleddy

hey guys - we are looking to see this merged by June 30th. If you have time and can merge that would be great. Otherwise maybe one of the FWT members will step in an merge.

comment:19 Changed 4 years ago by kleist

  • Keywords lexicon, catalog, internationalization added; lexiconcataloginternationalization removed
  • Component changed from Infrastructure to General

comment:20 Changed 4 years ago by esteele

  • Status changed from confirmed to closed

Merged.

Changed 10 months ago by sammy888

Note: See TracTickets for help on using tickets.