Capturing one block of unique text on a standardized form

replied on January 24, 2022

For the zone alignment issues, is this a standardized form? If so, you could try using the Form Alignment process to shift the form contents to a normalized position before using Zone OCR. If it's a form scaling/resolution issue, you could try positioning the zone using percentages instead of pixels.

For pattern matching to work, there has to be a pattern. I don't know much about your business domain, but it seems like most schools would have one of several keywords in their name (e.g., school, university, college, etc.) that could be used as a pattern (e.g., something like ".*\s+(school|college)" ). However, that might not be enough of a pattern to determine where to start capturing and where to end--it'd depend on the context of the information within the form and whether there's other markers to use.

Yes, pattern matching can work with a block of text (if, by that, you mean multiple lines of text). And it's often useful when combined with Zone OCR (which has an option to extract multiple lines of text) to get the text from one part of page and then narrow it down to just the data you want with pattern matching.

As for your question about grabbing information under specific text... it depends on the context and what else is around it, but generally, no this isn't something Quick Fields is good at. However, if you have access to Capture Profiles (via a Cloud account or the upcoming "hybrid" functionality for self-hosted installations), then it's trivial: there's a zone positioning option "relative to specific text" (also called "anchored zones") that you would use.

1 0

View 4 previous replies

replied on January 24, 2022 • Show version history

Unfortunately, the names don't all have a specific word in them that would work well with pattern recognition.

But I appreciate your "Form Alignment" suggestion. I'll give that a try.

0 0

replied on January 24, 2022

I'll second using hybrid Capture Profiles with anchored zones once they're available (Very Soon™).

1 0

replied on January 25, 2022

The hybrid Capture Profiles does sound like exactly what I need, hopefully it is available soon.

Alternatively, I'm playing with trying to use Pattern Matching to capture the line after a line that contains:

LEGAL FAMILY NAME LEGAL FIRST NAME LEGAL MIDDLE NAMES
But I'm really not great with Regular Expressions, so I'm not getting the result I expect.

I've tried this, which reading through I think is correct:

LEGAL FAMILY NAME\sLEGAL FIRST NAME\sLEGAL MIDDLE NAMES\n[A-Za-z '-]*[A-Za-z'-]

I did have some luck capture the line of text if I used a line number, but I've found that my electronic documents are getting random characters added at the top that messes up the line numbering. I have no idea why a letter "t" would be added, as the documents are pristine.

0 0

replied on January 25, 2022

You're missing a few things, I think:
1) You probably want quantifiers on those \s since there could be multiple whitespace characters between the name columns
2) There's actually two invisible line break characters: new line (\n) and carriage return (\r). Sometimes, but not always, you need both (I don't recall specifically for the OCR text returned by Quick Fields).
3) If you want to capture the entire next line, you might use a character class like this: [^\r\n]* that means "match any character except a line break"

1 0

replied on February 1, 2022

Thanks - this works perfectly in test, but not in the actual document. I added \s+ between the words, just in case the OCR was adding spaces.

Here's what I'm using:

LEGAL\s+FAMILY\s+NAME\s+LEGAL\s+FIRST\s+NAME\s+LEGAL\s+MIDDLE\s+NAMES\r\n([^\r\n]+)

0 0

replied on February 1, 2022

You'll need to inspect the generated text of the actual documents (in the Text pane of Quick Fields) and see what unexpected characters OCR introduced and then account for those in your pattern.

Also, my gut reaction looking at your pattern is that it's actually too specific--the odds of one of those dozens of characters being misread by OCR is high, which would mean there was no match. Your pattern only needs to be specific enough that it doesn't match anywhere else on the page (e.g., "MIDDLE\s+NAMES\r\n..." might be enough whereas "NAMES" probably wouldn't be).

1 0

replied on February 2, 2022

Played around with it, and I've got it working with just the word NAMES.

NAMES\r\n([^\r\n]+)

The fascinating thing is - I don't have tell it which lines to look on, it works across the entire document.

Thank you for this - you got me on the right path!

0 0

Question

Question

Capturing one block of unique text on a standardized form

Replies

Sign in to reply to this post.