OCR'ing PDF

replied on September 30, 2024

In short Yes.

However, in Laserfiche, there are a number of ways of accomplishing this.

If it's already in the repo then:

1. Generate pages: This step converts the PDF into image-based pages that Laserfiche can process. Right-click on the PDF document and select "Generate Pages."

2. OCR the generated pages: Once the pages are generated, you can then run OCR on these image-based pages. This process extracts the text from the images and makes it searchable.

This approach is appropriate because PDFs can contain various types of content, including text that's already searchable, images, and scanned content. By generating pages, Laserfiche ensures it has a consistent image-based format to work with for OCR.

It's worth noting that this process can increase storage requirements, as you'll now have both the original PDF and the generated pages. However, it allows for more comprehensive text extraction and searchability within the Laserfiche system.

If you're dealing with a large number of PDFs, you might want to consider automating this process. Laserfiche offers tools like Import Agent that can be configured to generate pages and perform OCR during the import process automatically.

For optimal results, ensure your PDFs are of good quality and properly oriented. If you're experiencing issues with certain PDFs, it might be related to factors like file size, font size, or document orientation.

Additionally, I would also suggest look at your document import settings to automate the page creation.

0 0

replied on September 30, 2024 • Show version history

Thank you so much. This what I needed to confirm.

0 0

replied on September 30, 2024

Do note that if the PDF already has a "text layer", Laserfiche can extract the text without needing to generate image pages and OCR them.

Text layers are typically present in digitally created PDFs, such as from exporting a Word document as a PDF, etc.

If you know the PDFs you'll be dealing with always have text layers, you don't need to generate images pages for OCR purposes.

3 0

replied on October 1, 2024 • Show version history

We have implemented Laserfiche’s federated search tool.   We are investigating the need to create text. The searches need to go beyond the metadata fields. Our environment is a mix of the following:
           Repositories that contain OCRed TIFF images
           PDF document with no text
           PDF documents submitted via form with text as they were received.
The goal is to set a Department wide standard on OCR'ing documents.     Below are my thoughts and questions.
• Do pdf's need to have text generated? If so, how do we do it and what is the most efficient method to process these?
• Should there also be a standard that says, all TIFF images need to be OCR’ed for repositories that store them.

0 0

replied on October 1, 2024

Do pdf's need to have text generated? If so, how do we do it and what is the most efficient method to process these?
Should there also be a standard that says, all TIFF images need to be OCR’ed for repositories that store them.

Question

Question

Replies

Sign in to reply to this post.