Ticket #8268 (reopened Bug)
Indexing/Transforming issue with Word files for the wvWare transform
Reported by: | wohnlice | Owned by: | nouri |
---|---|---|---|
Priority: | minor | Milestone: | 4.x |
Component: | General | Version: | 4.3 |
Keywords: | Cc: | grahamperrin, micecchi, keul |
Description
We're running Plone 3.0.6 with Zope 2.10.5, with wvWare. When posting any .doc file greater than around 1MB (not very big) as either a File or a PloneExFile (I am getting the same behavior with this product), the page will generally time out. Worse, the entire process will be slow - other users cannot navigate on any portal on that Zope server until the file finally finishes loading. I've taken the text of this file and saved it as a plain text file and as a PDF and have had no problems.
I was able to isolate the problem down to line 35 in PortalTransforms.transforms.office_wvware.py - "html = scrubHTML(html)" which I believe is there to remove any malicious scripts/tags in the html. If I comment out this line, I have absolutely no problems uploading/indexing doc files of a reasonable size, and regular navigation of the site by other users can occur while the file is uploading.
This is about as far as I go though - I don't have any experience with SGML parsers and do not really have an idea of what the problem may be.
Change History
comment:3 Changed 7 years ago by hannosch
- Owner set to nouri
- Component changed from Transforms to Archetypes
comment:4 Changed 4 years ago by kleist
- Status changed from new to closed
- Version set to 4.1
- Resolution set to wontfix
Plone 3 no longer supported. Please re-open if still an issue in Plone 4.
comment:5 Changed 2 years ago by micecchi
- Status changed from closed to reopened
- Cc micecchi, keul added
- Component changed from Archetypes to General
- Version changed from 4.1 to 4.3
- Milestone changed from 3.3.x to 4.x
- Resolution wontfix deleted
I have the same problem on a Plone 4.3.1 instance.
A *.doc file of 1mb takes a lot of time to be saved.
The same text saved as pdf is saved in much much minor time.
I don't have a clue about how to approach this. Someone else needs to take a look.