You are viewing limited content. For full access, please sign in.

Question

Question

OCR'ing PDFs

asked on February 6

One frustration I have is having to generate pages to OCR scanned pdf content, which can be costly on storage, create confusion for staff with multiple file types, the pdf itself not being searchable if exported, etc.

I'm wondering, has anyone implemented a process to OCR pdfs prior to ingestion? Do you use third party software like contentCrawler, and have you found it to be worth the cost? 

Thanks.

0 0

Replies

replied on February 6

Can you describe your capture process? Why are the documents being scanned as PDF in the first place?

Ideally you can use Laserfiche Scanning to scan documents as native images. You can also drop your PDFS into a folder and use Import Agent to import and OCR them.

0 0
replied on February 6

PDFs are far smaller (we scan in colour). Will import agent ocr pdfs or convert to tif to ocr?

0 0
replied on February 6

You can have it keep the PDF. It'll have pages, but that will add size. If you got rid of the pages, it would probably get rid of the text.

Sorry that I can't help more. We don't keep PDFs in our repo since we lose too much functionality, and the default compression inside of a PDF, while smaller, doesn't meet retention laws.

1 0
replied on February 6

Yeah, still thinking about possibly using PDFAs as well. I wish I knew more.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.