Ticket #9309 (closed PLIP: fixed)

Opened 7 years ago

Last modified 6 years ago

Better search for East Asian (multi-byte) languages.

Reported by: terapyon Owned by: terapyon
Priority: minor Milestone: 4.0
Component: Unknown Version:
Keywords: search splitter CJK Cc: terada@…, plip-advisories@…

Description

Summary: Default search mechanism that will incorporate East Asian languages (CJK; Chinese, Japanese and Korean) Plone's out-of-the-box search gives very poor ( unacceptable )results for CJK languages. This low searching quality severely undermines the claim on plone.org that "Plone handles Chinese (and) Japanese." CJK languages have no spaces between words, so an additional algorithm called N-gram split mechanism is required to break text up into words.

Detail: We will extend CMFPlone/UnicodeSplitter.py mdoule. If we need, patch-up ZCTextIndex module. The new spiller mechanism will handle string with Unicode. Languages that uses spaces to split word will be redirected to the good old splitter, and East Asian languages will be redirected to the new N-gram(Bi-gram) splitter, then indexed. For East Asian languages, input of search box will also be split in N-gram and gued to the index.

Plan: Japanese reason commitor group will manage core development process and development team build. We have been working on multi-byte splitter for a while and have core code for our proposal. Our plan indicates a new splitter code will be ready for world wide testing level by September 2009.

Change History

comment:1 Changed 7 years ago by alecm

This is a vitally important task. Are there risks? How is the determination of language made by the splitter? Please provide more detail, if possible.

comment:2 Changed 7 years ago by tyam

This attempt was made in google Summer of code project last year, but it could not get the goal. The development has continued and its under testing and tuning on Plone 3 in Japan. Since the logic will handle the non-CJK characters with current logic, only CJK users will be exposed to the risk of bad indexing and searching. But the current Plone out-of-the-box indexing and searching function does not work well with CJK characters anyway. We want participation from Chinese and Korean speaking community as much as possible, especially for testing. The splitting does not aim the text to the words. Bi-gram splitting aims to make sets(combinations) of two-characters. This combination will be used when the text gets indexed and also when the text gets searched.

comment:3 Changed 7 years ago by erikrose

Clearing Owner field of 4.0 PLIPs so we can use it to mean "implementor". (Many of these owners were automatically assigned from choosing a Component that had a default owner.)

comment:4 Changed 7 years ago by smcmahon

  • Cc plip-advisories@… added

comment:5 Changed 7 years ago by terapyon

More details as follows:

Plone currently does not produce good search results for Chinese, Japanese, and Korean (CJK). Text in those languages is not divided into words using white space, which means that Plone's standard white space-based text indexing does not work.

PROPOSAL

We propose giving Plone better CJK search out-of-the box by implementing a different indexing logic for CJK text. There are three main elements to this:

  1. CJK text detection. The indexing code recognizes CJK text by checking the code of each character of text being indexed. This will add a small overhead to all text indexing. It switches the indexing logic only when the code is in the range of CJK; non-CJK text will thus not be affected. When non-CJK characters follow CJK characters, indexing reverts to the standard white-space method.
  1. Splitting. CJK text is then split into index keys using the bi-gram method. There are two main methods of splitting CJK text: bi-grams and morphological analysis. Morphological analysis can yield better results, but requires maintenance of a dictionary. We propose using the bi-gram method because it does not require dictionary maintenance and because its results are good enough for text search purposes. The bi-gram method splits a text consisting of N characters into (N-1) sets of bi-grams, or pairs of adjacent characters. Each bi-gram is then registered as an index key.
  1. Searching. When searching, the search term is also split into a set of bi-grams that are used as search keys. These search keys are compared with the index keys, considering "matched" and "continuously adjacent". This should work like searching in English text for a double-quoted search term such as "I like Plone".
  • Consider this example; each roman character here represents a CJK character.
  1. The text "content" is indexed. This generates the following 6 index keys:

"co", "on", "nt", "te", "en", "nt"

  1. The search term "cont" is entered. This generates 3 search terms:

"co", "on", "nt"

  1. All 3 generated search terms will be matched with the first 3 keys adjacently. So, we can say "cont" is matched to "content".

RISKS

Low

  • Since the first step of this process checks the character code and switches the logic only when its code is in the range of CJK, non-CJK should not be affected. This approach also allows indexing methods for non-white-spaced languages other than CJK languages to be added in future.
  • Code range checking will be a small, but we believe acceptable, overhead per character for all users regardless of languages.

PARTICIPANTS

Manabu Terada - Leader

Mikio Hokari - Programmer

Takeshi Yamamoto - Documentation and Testing

Naotaka Hotta - Support and Documentation

Jonathan Lewis - Moral support and documentation

PROGRESS

A start on CJK indexing was made as a GSoC project in 2008. That project didn't deliver as much as hoped, but the mentor group has continued development and the East Asian Plone community has helped to test the code. The major remaining tasks are refining the code and repackaging it as a built-in component.

comment:6 Changed 7 years ago by MatthewWilkes

FWT Vote: +1

comment:7 Changed 7 years ago by rossp

FWT vote is +1, so long as this has an owner.

comment:8 Changed 7 years ago by davisagli

FWT vote: +1. Very interesting and important work.

When you submit your code for review, please include info on the measured overhead of the code range checking. If it is significant, please consider adding an environment variable to allow turning this off for sites that don't require indexing of CJK text.

comment:9 Changed 7 years ago by raphael

FWT vote is +1 and please think about how we should test your implementation once submitted.

comment:10 Changed 7 years ago by calvinhp

FWT Vote: +1 with the updated PLIP to include more detail

comment:11 Changed 7 years ago by esteele

Approved by FWT vote.

comment:12 Changed 7 years ago by esteele

  • Owner set to terapyon

comment:13 Changed 7 years ago by terapyon

comment:14 Changed 7 years ago by tyam

(In [28851]) plip9309 cfg and txt added refs #9309

comment:15 Changed 7 years ago by tyam

(In [29231]) Important sitecustomize.py info added. refs #9309

comment:16 Changed 7 years ago by tyam

(In [29594]) sitecustomize.py is NOT required anymore. refs #9309

comment:17 Changed 7 years ago by terapyon

  • Status changed from new to assigned

comment:18 Changed 7 years ago by vincentfretin

I forgot the include "#refs" in my commit comment, here is my review: https://dev.plone.org/plone/changeset/29684

comment:19 Changed 7 years ago by matthewwilkes

(In [29720]) PLIP review refs #9309 - FWT Vote -1

comment:20 Changed 7 years ago by esteele

Your PLIP has passed the Framework team's initial review. Feel free to discuss any suggested changes either here in the PLIP ticket or on the mailing lists. Final deadline for this PLIP is set for September 30.

comment:21 Changed 7 years ago by terapyon

Thank you for reviewing my code.

1) Don't use monkey patch. OK, I will create new branche on both Zope and Plone.

2) Broken livesearch. I think that problem is not cause by our code. In Plone 3.3.0, it was already broken. Then in Plone 3.3.1 it became functional. There must be some other code doing that.

3) UnicodeDecodeError and not using re.LOCALE I will modify my code to use "re.LOCALE". But I have no idea, when and how to use it?

4) Performance As I see, performance of code comes after functional of code, isn't it?

5) Functions nameing and not docstrings. I will rename functions and write docstrings.

6) Using fuctions and args. I will modiry it.

7) Test coverage. OK, I will write additional tests after I have made regained change.

8) Don't use functions. I will check it.

9) Last. I and my development team should have finished revising old code by end of SEP. And I'm going to join on "21th Plone Tune-Up day" from Japan.

comment:22 Changed 6 years ago by terapyon

(In [30028]) create branches/plip9309-unicodesplitter refs #9309

comment:23 Changed 6 years ago by terapyon

(In [30029]) adding branches/plip9309-unicodesplitter refs #9309

comment:24 Changed 6 years ago by terapyon

(In [30031]) adding branches/plip9309-unicodesplitter refs #9309

comment:25 Changed 6 years ago by terapyon

(In [30038]) replace UnicodeSplitter refs #9309

comment:26 Changed 6 years ago by terapyon

(In [30044]) modify CaseNormalizer using enc to getSiteEncoding refs #9309

comment:27 Changed 6 years ago by terapyon

(In [30046]) UnicodeSplitter disable monkey.py refs #9309

comment:28 Changed 6 years ago by terapyon

(In [30047]) using modified ZCTextIndex refs #9309

comment:29 Changed 6 years ago by terapyon

(In [30051]) using modified ZCTextIndex refs #9309

comment:30 Changed 6 years ago by tyam

(In [30073]) update to apply branched Plone and Zope2. refs #9309

comment:31 Changed 6 years ago by tyam

(In [30092]) fixing source description for cfg. refs #9309

comment:32 Changed 6 years ago by tyam

(In [30112]) fixed cfg file to update Zope. refs #9309

comment:33 Changed 6 years ago by terapyon

(In [30113]) UnicodeSplitter fixed cfg file, and some code done refs #9309

comment:34 Changed 6 years ago by terapyon

(In [30117]) UnicodeSplitter fixed don't use monkey, some code done refs #9309

comment:35 Changed 6 years ago by tyam

(In [30128]) Modification complete. Now READY for 2nd review. refs #9309

comment:36 Changed 6 years ago by vincentfretin

(In [30143]) Update review (refs #9309)

comment:37 Changed 6 years ago by davisagli

Vincent's latest review says there are 8 failures when running the CMFPlone tests. I'm only seeing 3, and one of them is due to the mail host changes in Zope 2.12.0. The other two failures are in tests introduced by this PLIP, and while they should certainly be fixed, that's less worrisome than if this PLIP's changes were causing failures in existing tests.

comment:38 Changed 6 years ago by davisagli

Never mind that last message; my buildout wasn't entirely up to date. Now it is, and I'm only seeing the mailhost-related test failure.

comment:39 follow-up: ↓ 41 Changed 6 years ago by vincentfretin

I removed src/Plone, did buildout again and I don't have errors anymore, only the 3 existing failures (two in testUnicodeSplitter.py, and 1 in mails.txt)

comment:40 Changed 6 years ago by vincentfretin

(In [30172]) I don't have errors anymore, strange. (refs #9309)

comment:41 in reply to: ↑ 39 Changed 6 years ago by terapyon

Replying to vincentfretin:

I removed src/Plone, did buildout again and I don't have errors anymore, only the 3 existing failures (two in testUnicodeSplitter.py, and 1 in mails.txt)

Please let me know detail of errors.

comment:42 follow-up: ↓ 43 Changed 6 years ago by vincentfretin

The 3 failures exist in the Plone/branches/4.0 too.

Failure in test testProcessLatin1 (Products.CMFPlone.tests.testUnicodeSplitter.TestSplitter)
Traceback (most recent call last):
  File "/usr/lib/python2.6/unittest.py", line 279, in run
    testMethod()
  File "/home/vincentfretin/svn/plone-coredev/branches/4.0/src/Plone/Products/CMFPlone/tests/testUnicodeSplitter.py", line 90, in testProcessLatin1
    self.assertEqual(self.process(input), output)
  File "/usr/lib/python2.6/unittest.py", line 350, in failUnlessEqual
    (msg or '%r != %r' % (first, second))
AssertionError: ['ffin', 'foo'] != ['\xc4ffin', 'foo']

...

Failure in test testNormalizeLatin1 (Products.CMFPlone.tests.testUnicodeSplitter.TestCaseNormalizer)
Traceback (most recent call last):
  File "/usr/lib/python2.6/unittest.py", line 279, in run
    testMethod()
  File "/home/vincentfretin/svn/plone-coredev/branches/4.0/src/Plone/Products/CMFPlone/tests/testUnicodeSplitter.py", line 123, in testNormalizeLatin1
    self.assertEqual(self.process(input), output)
  File "/usr/lib/python2.6/unittest.py", line 350, in failUnlessEqual
    (msg or '%r != %r' % (first, second))
AssertionError: ['\xc4ffin'] != ['\xe4ffin']

.............................................................................................................................................................................................

Failure in test /home/vincentfretin/svn/plone-coredev/branches/4.0/src/Plone/Products/CMFPlone/tests/mails.txt
Failed doctest test for mails.txt
  File "/home/vincentfretin/svn/plone-coredev/branches/4.0/src/Plone/Products/CMFPlone/tests/mails.txt", line 0

----------------------------------------------------------------------
File "/home/vincentfretin/svn/plone-coredev/branches/4.0/src/Plone/Products/CMFPlone/tests/mails.txt", line 57, in mails.txt
Failed example:
    b64decode(msg.message.get_payload())
Expected:
    '...You are receiving this mail because T\xc3\xa4st user\ntest@plone.test...is sending feedback about the site administered by you at...The message sent was:...Another t\xc3\xa4st message...
Got:
    'b\x8b\x9a\xad\xea\xdeq\xe8\xaf\x8ax-\x86+&j)[y\xc6\xae\xb1\xe4'


  Ran 1067 tests with 3 failures and 0 errors in 4 minutes 15.573 seconds.
Running zope.testing.testrunner.layer.UnitTests tests:
  Tear down Products.PloneTestCase.layer.PloneSite in 0.545 seconds.
  Tear down Products.PloneTestCase.layer.ZCML in 0.027 seconds.
  Tear down Testing.ZopeTestCase.layer.ZopeLite in 0.000 seconds.
  Set up zope.testing.testrunner.layer.UnitTests in 0.000 seconds.
  Running:
..................................
  Ran 34 tests with 0 failures and 0 errors in 0.050 seconds.
Tearing down left over layers:
  Tear down zope.testing.testrunner.layer.UnitTests in 0.000 seconds.

Tests with failures:
   testProcessLatin1 (Products.CMFPlone.tests.testUnicodeSplitter.TestSplitter)
   testNormalizeLatin1 (Products.CMFPlone.tests.testUnicodeSplitter.TestCaseNormalizer)
   /home/vincentfretin/svn/plone-coredev/branches/4.0/src/Plone/Products/CMFPlone/tests/mails.txt
Total: 1239 tests, 3 failures, 0 errors in 4 minutes 36.844 seconds.

comment:43 in reply to: ↑ 42 Changed 6 years ago by terapyon

Replying to vincentfretin:

The 3 failures exist in the Plone/branches/4.0 too.

Thank you for sending me error logs.

I felt that test code tries to test very rare case scenario. And it always fail in a certain language circumstances.

In my program, I use SiteEncoding to convert text to UNICODE. So, that test code makes problem, because Latin-1 exists in str and it only runs on locale.setlocale enable condition.

comment:44 Changed 6 years ago by matthewwilkes

(In [30234]) Merge review for Unicode Splitter plip, refs #9309. FWT Vote is now +1

comment:45 Changed 6 years ago by terapyon

(In [30242]) UnicodeSplitter rewrite created and removing copyright refs #9309

comment:46 Changed 6 years ago by rossp

Seems like the issues have been addressed.

FWT vote: +1 for merge

comment:47 Changed 6 years ago by esteele

This PLIP has been accepted for merging into Plone 4.0

The final vote was: Alec Mitchell +1 David Glick +1 Erik Rose - Laurence Rowe +1 Matthew Wilkes +1 Ross Patterson +1

Please merge your branches into the Plone 4.0 head by end-of-day Friday Oct 16. If you need assistance with merging, please contact me.

We'll be assigning a documentation ticket to this PLIP shortly. Please assist the docs team in documenting the changes and new features that this PLIP introduces.

comment:48 Changed 6 years ago by esteele

Please assist the doc team in creating/updating documentation relating to this PLIP. See #9613.

comment:49 Changed 6 years ago by tyam

(In [30502]) Changed cfg since Zope code merged to 2.12. refs #9309

comment:50 Changed 6 years ago by limi

Which means that we'll need a Zope 2.12.1 release. Code should already be in the repository.

comment:51 Changed 6 years ago by esteele

(In [30540]) Use the Zope 2.12 branch. PLIP #9309 will need it. This auto-checkout can be removed once we have a 2.12.1 available. Refs #9309.

comment:52 Changed 6 years ago by terapyon

(In [30565]) merging PLIP9309 refs #9309

comment:53 Changed 6 years ago by tyam

(In [30617]) Changed cfg for aligning with merged branch. refs #9309

comment:54 Changed 6 years ago by esteele

  • Status changed from assigned to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.