You are viewing limited content. For full access, please sign in.

Question

Question

Cannot OCR certain Large size PDF documents

asked on April 30, 2020

Hello! 

I have a customer who is using OCR on each document's creation in Laserfiche.

Everything works great except some documents which are not OCRed.

I checked some of these pdfs and turns out that most of them are rotated with a very small font size.

Now my question is, What are the factors that might keep Laserfiche from OCRing a certain document? 

  1. Large file size ? if yes, what is the max file size limit?
  2. Small font size ? if yes, any idea about the minimum font size? 
  3. Paper's direction ? 

 

And is there a possible way to workaround this issue because the number of non OCRed pdfs now which are being imported to Laserfiche is increasing on a daily basis since most of their pdfs presents the same situation.

thank you in advance.

 

0 0

Replies

replied on April 30, 2020

I don't believe there is a file size limit, rather OCR has a time limit after which it will give up.

Distributed Computing Cluster is a good solution for catching up on non-OCR documents.

1 0
replied on April 30, 2020

There is a size limit too, but you're right about the time limit.

Maximum image sizes

  1. The height or width of black-and-white images must not exceed 28 inches (71cm) nor 8400 pixels.
  2. The height or width of grayscale and color images must not exceed 28 inches (71cm) and also have the following pixel limits:
  • Resolution from 75 to 150 dpi: 4200 pixels
  • Resolution from 151 to 200 dpi: 5600 pixels
  • Resolution from 201 to 600 dpi: 8400 pixels

 

Minimum image sizes

The width and height of page images must be at least 50 pixels.

Resolution The resolution of page images must be between 75 and 600 dpi.

 

300 dpi is the preferred resolution. The image needs to be right side up for OCR if you don't have auto-rotate on.

 

The timeout is documented in this KB article.

 

 

2 0
replied on May 5, 2020

Hello Kiruba, 

The thing is i tried to manually OCR these documents after the importing is done with no success. Thats when i came here for guidance.

I already took Miruna's advice and i opened a case with Laserfiche. I personally think a version upgrade might do it because i took 6 examples from these document which threw errors and tried them on my machine where Laserfiche Client 10.4.2.236 is installed, and all of them were successfully OCRed. 

Thank you anyway for your insights.

 

1 0
replied on April 30, 2020

Thank you Erik and Miruna for you answer,

Distributed Computing Cluster is not an option in my case. 

Allow me to elaborate so you can a better insight on the situation

The thing is i am trying to bulk import a pretty big number of pdf documents all at once ( around 2000 pdfs or more). And i configured the Laserfiche Client as you know to OCR these documents automatically. 

Among these 2000 pdfs, around 100 pdfs are not being OCRed. The operation is not stopping rather it is throwing 2 errors on each page:

  1. error [6408] - Error Reading File
  2. error [404] - Error Preparing page for OCR 

And then the import process continues normally and throws the same errors on other document's pages.

And please note also that the document is saved normally in the system. When i try to manually OCR it, the same error throws as well.

Appreciate your attention

In the meantime i will check the maximum and minimum requirements Miruna sent for a successful OCR.

0 0
replied on April 30, 2020

Then you'll want to open a case with Tech Support and attach a few sample PDFs so we can take a look at them.

OCR is independent from import, so the document being saved in the repository is expected behavior. Getting OCR errors on one document wouldn't prevent subsequent documents from being imported.

0 0
replied on May 1, 2020

Alright.

Thank you for your help Miruna!

 

0 0
replied on May 5, 2020

Hi Joseph,

 

I would suggest not to OCR while importing the larger number of files together. In this case, import the file without OCR and OCR all the documents once imported.

OCR engine will consume much of RAM while OCRing larger documents.

Can you open and view the files in Laserfiche those files throwing OCR errors. You can also Try generate pages option if possible.

 

1 0
You are not allowed to follow up in this post.

Sign in to reply to this post.