PDF text extraction returns incorrect result

asked on July 13, 2017

We have a process we are starting that will require to use the Workflow activity, "Retrieve Document Text" in order to map text to relevant Metadata fields. In some initial testing I found that the PDF document that will be imported to Laserfiche to be read from, are sitting with an Adobe message as the available text that was extracted.

I have seen this same message when using certain web browser to open specific type PDF files.

The message is:

Please wait...
If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by
visiting http://www.adobe.com/go/reader—download.
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries.

The problem seems to be at the time of import and the text extraction that takes place there, resulting that Workflow retrieve the same text and not the actual text on the PDF.

I find that if I Snapshot this document and generate the OCR text from the image then I get a better result but due to the dark areas not all information seem to be OCR'd.

I would like to know if there is a known mechanism to amend the Text extraction on the Electronic file that will avoid having to Snapshot these documents? The process will cater for receiving these document electronically possibly in bulk and we would prefer to just automate the upload of these documents into the repository and then have Workflow read the correct extracted text and map the various metadata field.

If store a different PDF file then the extraction works correctly, so it does seem to only effect these specific types of PDF we are to be receiving from the supplier.

Any ideas on how we can get around this?

0 0

Question

Question

PDF text extraction returns incorrect result

Replies

Sign in to reply to this post.