You are viewing limited content. For full access, please sign in.

Discussion

Discussion

Quick Fields: Is it possible to skip OCR if a very small font is detected?

posted on July 20, 2023

We are scanning employee hire packets that include forms such as W4 and State Tax Withholding.  I have noticed that page 2 of each of those documents contains incredibly small typeface and QF really struggles trying to OCR these pages in order to identify them; the process slows down so much that it adds another 20 minutes to the scanning time compared to when we remove these pages ahead of time.  We'd like to avoid having to manually remove these pages and are hoping there's a way to configure QF to just skip these pages.  Is that possible?

 

0 0
replied on July 25, 2023

Do those forms have a set page length? If so, you can set the document class to make documents with a specific page length. That way QF wouldn't attempt to OCR these pages to identify them, it would just append them to the current document once it finds the first page of one of those forms.

0 0
replied on July 25, 2023

Well...I haven't found a way to skip pages with small text size, so I ended up putting my OCR task in Pre-Classification Processing.  That way, the OCR runs once for each page, instead of once for each page in each document classification.  The OCR data can be checked using Token Identification Conditions within each document Classification which makes the whole process much faster.  

0 0
replied on July 21, 2023

You can specify to skip whole pages if you want. Are you OCRing the whole document and storing it in your repository? or Are you just looking for specific information?

1 0
replied on July 21, 2023

The hire packet has 67 pages in it, with some single-page and some multi-page forms.  Each form has its own document classification with an OCR first page identification condition.  The zones vary for each doc class, but are typically in the top third of the page. The first page of each packet has a bar code with the employee number in it, which is stored in a token collector so it can be applied to each doc class with a token retriever.

The goal is to speed up the scanning process by skipping pages that have very small text, such as the secondary pages on the W4 and the State Tax Withholding forms.

 

 

0 0
replied on July 26, 2023 Show version history

While I'm not sure what your identification conditions are, so this might not be relevant, if you put the OCR process in Page Processing, you can configure which pages will be OCRed (and which will be skipped).  There's two ways to do this:

1) Use the Page Range property

2) Use a Conditional process

Since it sounds like you always want to skip Page 2 of certain document classes, you can probably just use the Page Range property.  It allows somewhat complex patterns, which are documented here.  So, for example, if you always want to skip Page 2, you could use the pattern: 1, 3-1000 (1000 is just an arbitrarily large number to make sure it doesn't unintentionally skip pages before the end of the document).

If you had a more complex situation than just skipping a specific page (e.g., Page 2), you could use a Conditional process with the OCR process inside of it.

1 0
replied on July 26, 2023

I agree with Jacob, I would do the same thing. That way you are not capturing a bunch of redundant information that is going to eat up your memory or slow down your PPM

0 0
replied on July 26, 2023

That sounds like it's worth a try...although sometimes the users who put together the packages will include page 2 of the W4 and sometimes they don't; it's not exactly consistent.  Thanks for that info though!

 

0 0
replied on August 3, 2023

you could use the page removal option as well... a conditional IF statement then remove this page

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.