You are viewing limited content. For full access, please sign in.

Discussion

Discussion

Quick Fields & PDF Text Extraction

posted on September 1, 2017

I have a requirement to process some PDF documents which need to be split which means I have to generate images for those PDFs. The PDF's also contain a text stream so I want to use that to perform pattern matches on rather than rely on OCR. 

I seem to be having mixed results with different sessions despite seemingly having the same settings. The sessions are configured to retrieve documents from a network folder and generate images from the PDFs, and extract the text. When the PDFs are processed, I can see text/OCR errors in the text pane but I want it to use the text stream instead.

I've tried adding text extraction in pre classification but that has no effect, as does adding OCR.


In cases where I am using zonal OCR, I have configured it to use the existing text but not sure if it is using the text stream or the OCR'd text.

So my goal is to process the PDFs, generate image pages, split those documents and read the data using pattern matching but using the text stream from the PDF. Is that possible?



 

0 0
replied on September 1, 2017 Show version history

I would approach it slightly differently, using import agent to bring the PDFs into Laserfiche and generate images.  Then I would use workflow to split the documents and do pattern matching.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.