I have a requirement to process some PDF documents which need to be split which means I have to generate images for those PDFs. The PDF's also contain a text stream so I want to use that to perform pattern matches on rather than rely on OCR.
I seem to be having mixed results with different sessions despite seemingly having the same settings. The sessions are configured to retrieve documents from a network folder and generate images from the PDFs, and extract the text. When the PDFs are processed, I can see text/OCR errors in the text pane but I want it to use the text stream instead.
I've tried adding text extraction in pre classification but that has no effect, as does adding OCR.
In cases where I am using zonal OCR, I have configured it to use the existing text but not sure if it is using the text stream or the OCR'd text.
So my goal is to process the PDFs, generate image pages, split those documents and read the data using pattern matching but using the text stream from the PDF. Is that possible?