You are viewing limited content. For full access, please sign in.

Question

Question

OCR'ing PDF

asked on September 30

I am doing some research on OCR'ing PDF's in Laserfiche, and I would like to understand the process.

If you have a PDF document, and you want to OCR all of it, do you need to generate pages and then OCR the document, or can you OCR the pdf (electronic document) itself?

0 0

Replies

replied on September 30

In short Yes.

However, in Laserfiche, there are a number of ways of accomplishing this.

If it's already in the repo then:

1. Generate pages: This step converts the PDF into image-based pages that Laserfiche can process. Right-click on the PDF document and select "Generate Pages."

2. OCR the generated pages: Once the pages are generated, you can then run OCR on these image-based pages. This process extracts the text from the images and makes it searchable.

This approach is appropriate because PDFs can contain various types of content, including text that's already searchable, images, and scanned content. By generating pages, Laserfiche ensures it has a consistent image-based format to work with for OCR.

It's worth noting that this process can increase storage requirements, as you'll now have both the original PDF and the generated pages. However, it allows for more comprehensive text extraction and searchability within the Laserfiche system.

If you're dealing with a large number of PDFs, you might want to consider automating this process. Laserfiche offers tools like Import Agent that can be configured to generate pages and perform OCR during the import process automatically.

For optimal results, ensure your PDFs are of good quality and properly oriented. If you're experiencing issues with certain PDFs, it might be related to factors like file size, font size, or document orientation.

Additionally, I would also suggest look at your document import settings to automate the page creation.

0 0
replied on September 30 Show version history

Thank you so much.  This what I needed to confirm.

 

0 0
replied on September 30

Do note that if the PDF already has a "text layer", Laserfiche can extract the text without needing to generate image pages and OCR them.

Text layers are typically present in digitally created PDFs, such as from exporting a Word document as a PDF, etc.

If you know the PDFs you'll be dealing with always have text layers, you don't need to generate images pages for OCR purposes.

3 0
replied on October 1 Show version history

We have implemented Laserfiche’s federated search tool.   We are investigating the need to create text.  The searches need to go beyond the metadata fields.  Our environment is a mix of the following:
           Repositories that contain OCRed TIFF images
           PDF document with no text
           PDF documents submitted via form with text as they were received. 
The goal is to set a Department wide standard on OCR'ing documents.     Below are my thoughts and questions.
•  Do pdf's need to have text generated?  If so, how do we do it and what is the most efficient method to process these?
•  Should there also be a standard that says, all TIFF images need to be OCR’ed for repositories that store them.

0 0
replied on October 1

We have implemented Laserfiche’s federated search tool.   We are investigating the need to create text.  The searches need to go beyond the metadata fields.  Our environment is a mix of the following:
           Repositories that contain OCRed TIFF images
           PDF document with no text
           PDF documents submitted via form with text as they were received. 
The goal is to set a Department wide standard on OCR'ing documents.     Below are my thoughts and questions.

  • Do pdf's need to have text generated?  If so, how do we do it and what is the most efficient method to process these?
  • Should there also be a standard that says, all TIFF images need to be OCR’ed for repositories that store them.
replied on October 1

1. Generate pages options just creates a TIFF image of each page of your PDF Document.

2. Generate searchable text which creates a "txt" of the OCR information. 


Tip: when you generate pages from PDF File you are still keeping the pdf and creating 2nd document which will be tiff image.

 

OCR.png
OCR.png (66.14 KB)
0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.