You are viewing limited content. For full access, please sign in.

Question

Question

OCR does not pick up all the text, or picks it up wrong

asked on November 23, 2018

Have you ever seen the OCR process pick up the text on some lines right and then pick up other text on other lines wrong?  In my example this always happens on the same workflow created document, which WF turns into a pdf and then stores in LF.  It picks up the same lines right and the same lines wrong each time.  The lines it gets wrong are interesting:  It changes what it picks up and adds one number or letter up

Example: 

  • The image shows 780-384-1014
  • The OCR shows    891.496.2125  (notice how the 7 became an 8; the 8 became a 9?
  • and for the letters
  • The image shows Fire Department
  • The OCR shows   Gjsf!Efqbsunfou (notice how the F becomes a G, which is the next letter in the alphabet; then the i in Fire becomes a j; etc., etc.)

 

What would cause this?  Note, the OCR process also doesn't pick up the first inch of the top of the page.

 

0 0

Replies

replied on November 27, 2018

How are the PDF's generated? Are they scanned, or natively generated somewhere else? I've found that different PDF generators can write PDFs that fool with Laserfiche's ability to extract text.

For example, if I build a report in SSRS and add a field before I add the field label, SSRS renders the field "before" the label in the structure of the PDF. Since PDF was originally designed as a typesetting format, a writer can place elements in any order. The positioning is defined for each element. The result is that when I use Laserfiche's native text extraction for PDF, my field label actually shows up in the text after the field data itself. This is just one example of some of the weirdness that can happen when extracting text from PDFs.

If the PDF is an image, I don't have a ready explanation for the behavior that you are seeing. However, if it's generated, you might try changing your settings for generating text from PDFs and see if it makes a difference. How are you bringing the documents in? Are the users importing them directly into the client, or is it Import Agent?

Here are the settings for the desktop client, and they are similar in Import Agent:

 

As far as the certificate goes, we've found that the engraving around the edges of certificates tends to mess with things like orientation detection and OCR. You either need to clean them up manually, or run them through Quick Fields so that you can rotate and Zone OCR based just on the center portion.

2 0
replied on November 28, 2018

Thanks, Devon

The first example (the fire permit) is a pdf form with fields that was saved on our shared drive and then is used by workflow to fill in the blanks and save to Laserfiche via workflow.  So, that one might be due to the pdf fields.  Some fields are reading right, but some are not.  Maybe I messed up the fields during creation of the blank form?

For the certificates, they are scanned at the photocopier and then brought into Laserfiche via the Import Agent as a tiff.  So, no pdf involved.

0 0
replied on November 23, 2018

We also noticed this one today.  In this example it missed the first inch of the form, but it also missed the very middle, where it says Rating 98% and the line that starts with Restrictions:

This page came in sideways, was rotated, and then the OCR process was manually redone.

0 0
replied on November 28, 2018

Devon, found this one today and we are now testing the settings you pointed out to see if it makes a difference, since it turns out this was a pdf that was drag-n-dropped into Laserfiche:

0 0
replied on November 28, 2018

That's an odd one. Are you using the accuracy OCR mode?

0 0
replied on November 28, 2018

Yes, I check everyones' settings when they start here, and whenever I have to set up their Snapshot Printer, I make sure the OCR process in their settings are always set to accuracy.  

0 0
replied on November 29, 2018

Is it permissible to post some samples of the PDFs that you are having trouble with? I'd like to test in a different environment.

0 0
replied on November 29, 2018

Just attached one of the approved fire permits that was just a test document.  This type of document is the one mentioned in my first post.

0 0
replied on November 29, 2018


On the left is the result of generating text with the "Use native text extraction" option set. On the right is with "Use an alternative method to generate text". It looks like what is happening is that the data that was entered is on a separate layer. So it gets extracted separately from the body text. By changing the handling of PDF text, Laserfiche runs a more traditional OCR process, and is able to get correct location data.

Interestingly, if I use a PDF form that I have on hand, Laserfiche doesn't extract the field data at all unless I use the advanced OCR option. If I fill out the form and save as a static PDF, I get the same behavior that you are seeing. The text I filled out is saved on another layer.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.