You are viewing limited content. For full access, please sign in.

Question

Question

Improving OmniPage Zone OCR Accuracy in Quick Fields

asked on April 29 Show version history

I’m configuring a Quick Fields session that relies on OmniPage Zone OCR to extract key data from legislative documents. Here's the general setup:

  1. First Page Identification: Using Zone OCR to locate the words ORDINANCE or RESOLUTION.

  2. Document Type Extraction: Using pattern matching to determine the document type (e.g., Resolution or Ordinance).

  3. Document Number Extraction: Using pattern matching to capture numbers like 25-101.

However, the OCR output during the session is inaccurate or inconsistent. I've experimented with:

  • Character Preferences: Prioritizing letters or numbers.

  • Optimization Styles: Switching between Balanced and Accuracy.

While some settings are slightly better, the improvement is minimal.

Interestingly:

  • The text displays correctly in the repository viewer.

  • When I insert the same page as a sample and OCR it with Accuracy optimization, the result is very accurate.

Question:
Is there a way to improve OCR accuracy within the session runtime to match the quality seen when OCR'ing a sample file manually?

 

Below is the OCR behavior comparison:

Figure 1: Shows the document and how cleanly Laserfiche Client extracts the text.

 

Figure 2: Shows the OCR result when the same page is added as a sample document and OCR’ed in Quick Fields using Accuracy optimization — result is excellent.

 

Figure 3: Shows the Zone OCR output during actual session runs, which is significantly less accurate and sometimes unreadable.

2 0

Answer

SELECTED ANSWER
replied on June 20

Thank you @Zonghong Zheng!
I tested the “Use JPEG compression” setting - adjusting it below 100 (tried for 80 and 50 etc.) did improve the OCR results. However, the output is still not ideal, as all six of our field values rely on accurate OCR for both data extraction and document separation.

During a Workflow class at Empower, I discussed our scenario with instructor Hilario, who offered some helpful suggestions. With support from our solution provider MCCi (shout out to Tom Borynski) - we now have an automated Workflow in place for our legislative documents capturing:

(1) Users scan batch files into a monitored folder using a separator sheet.

(2) Import Agent OCRs the batch and sends the documents to the repository.

(3) A Laserfiche Workflow automatically triggers, separates the documents, extracts metadata fields, and files them into their proper destination folders.

(4) Certain fields are automatically validated for data format and accuracy. If potential issues are detected, the document is routed to a staging folder for human review.

Note: We tested that 400 DPI B/W scanning resolution produced the most reliable OCR results.

This capture process is now nearly 100% automated using current “dumb fields” technology. We’re excited for the upcoming release of Smart Fields for self-hosted environments later this fall!

I'd be happy to share the workflow or assist who's interested! 

1 0

Replies

replied on May 9

Hi Margaret,

There is a possibility that inaccuracy is caused image compression in the runtime. There is a configuration in Tools -> Options -> Quick Fields -> General -> Use JEPG compression. Do you enable this option and set Quality level of JEPG lower than 100?

0 0
SELECTED ANSWER
replied on June 20

Thank you @Zonghong Zheng!
I tested the “Use JPEG compression” setting - adjusting it below 100 (tried for 80 and 50 etc.) did improve the OCR results. However, the output is still not ideal, as all six of our field values rely on accurate OCR for both data extraction and document separation.

During a Workflow class at Empower, I discussed our scenario with instructor Hilario, who offered some helpful suggestions. With support from our solution provider MCCi (shout out to Tom Borynski) - we now have an automated Workflow in place for our legislative documents capturing:

(1) Users scan batch files into a monitored folder using a separator sheet.

(2) Import Agent OCRs the batch and sends the documents to the repository.

(3) A Laserfiche Workflow automatically triggers, separates the documents, extracts metadata fields, and files them into their proper destination folders.

(4) Certain fields are automatically validated for data format and accuracy. If potential issues are detected, the document is routed to a staging folder for human review.

Note: We tested that 400 DPI B/W scanning resolution produced the most reliable OCR results.

This capture process is now nearly 100% automated using current “dumb fields” technology. We’re excited for the upcoming release of Smart Fields for self-hosted environments later this fall!

I'd be happy to share the workflow or assist who's interested! 

1 0
You are not allowed to follow up in this post.

Sign in to reply to this post.