Ticket #12557 (confirmed Bug)
Products.CMFPlone.UnicodeSplitter.splitter can crash at unicodedata.normalize
Reported by: | tnagai@… | Owned by: | vincentfretin |
---|---|---|---|
Priority: | major | Milestone: | 4.x |
Component: | Backend (Python) | Version: | 4.0 |
Keywords: | Cc: |
Description
We use Plone 4.0.5 to develop our new university site. One of our staff found that she couldn't upload a particular Japanese PDF document. When she uploaded the PDF, Plone process crashed and automatically restarted.
I confirmed Plone crashes at Products.CMFPlone.UnicodeSplitter.splitter module. The call of unicodedata.normalize method in the module crashes Python interpreter itself. The PDF unexpectedly contains a special sequence of Unicodes causing the crash.
This problem seems to already known in Python community : http://bugs.python.org/issue10254. However, Plone 4.0.x uses Python 2.6 and this problems is not fixed.
To solve this problem, I replaced unicodedata.normalize by Normalizer in PyICU and it works fine (although we need to install PyICU).
$ diff splitter.py.org splitter.py.fixed 8a9
from icu import *
16a18
normalizerNFKC = Normalizer2.getInstance(None, "nfkc",UNormalizationMode2.COMPOSE)
89c91 < normalized = unicodedata.normalize('NFKC', uni) ---
normalized = unicode(normalizerNFKC.normalize(uni))
104c106 < normalized = unicodedata.normalize('NFKC', uni) ---
normalized = unicode(normalizerNFKC.normalize(uni))