You are viewing limited content. For full access, please sign in.

Question

Question

PDF text extraction returns incorrect result

asked on July 13, 2017

We have a process we are starting that will require to use the Workflow activity, "Retrieve Document Text" in order to map text to relevant Metadata fields. In some initial testing I found that the PDF document that will be imported to Laserfiche to be read from, are sitting with an Adobe message as the available text that was extracted. 

I have seen this same message when using certain web browser to open specific type PDF files. 

The message is:

Please wait... 
If this message is not eventually replaced by the proper contents of the document, your PDF 
viewer may not be able to display this type of document. 
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by 
visiting http://www.adobe.com/go/reader—download. 
For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. 
Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark 
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other 
countries.

The problem seems to be at the time of import and the text extraction that takes place there, resulting that Workflow retrieve the same text and not the actual text on the PDF.

I find that if I Snapshot this document and generate the OCR text from the image then I get a better result but due to the dark areas not all information seem to be OCR'd.

I would like to know if there is a known mechanism to amend the Text extraction on the Electronic file that will avoid having to Snapshot these documents? The process will cater for receiving these document electronically possibly in bulk and we would prefer to just automate the upload of these documents into the repository and then have Workflow read the correct extracted text and map the various metadata field. 

If store a different PDF file then the extraction works correctly, so it does seem to only effect these specific types of PDF we are to be receiving from the supplier. 

Any ideas on how we can get around this?

0 0

Replies

replied on July 13, 2017

This usually happens for PDFs with dynamic fields (aka XFA PDFs). We cannot extract pages or text from them as we do for regular PDFs. They need to be Snapshot and OCRed instead.

0 0
replied on July 14, 2017

Thank you for the feedback Miruna.

 

Will see that we then use a snapshot mechanism in WF to do this.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.