You are viewing limited content. For full access, please sign in.

Question

Question

Capturing one block of unique text on a standardized form

asked on January 24, 2022

Thank you to everyone who posts such great questions, I've been able to work through this and resolve many of my questions.

I've got one issue that I still can't solve, and I'm hoping someone can help.  I'm scanning student transcripts, and the transcripts come from multiple schools.  The school field has been problematic to scan.  I've used Omnipage Zone OCR to grab the area, because I haven't had any luck with Pattern Matching to get this information.  But if the scan is a little out of alignment, it doesn't capture the information from that area. 

Unfortunately, on this document the schools are listed by their name without anything simple like "school" attached.   Pattern matching has worked great for student numbers, DoB, student names), grad dates.  Is there any way to do pattern matching with a block of text like a school name? 

0 0

Replies

replied on January 24, 2022

For the zone alignment issues, is this a standardized form? If so, you could try using the Form Alignment process to shift the form contents to a normalized position before using Zone OCR.  If it's a form scaling/resolution issue, you could try positioning the zone using percentages instead of pixels.

For pattern matching to work, there has to be a pattern.  I don't know much about your business domain, but it seems like most schools would have one of several keywords in their name (e.g., school, university, college, etc.) that could be used as a pattern (e.g., something like ".*\s+(school|college)" ).  However, that might not be enough of a pattern to determine where to start capturing and where to end--it'd depend on the context of the information within the form and whether there's other markers to use.

Yes, pattern matching can work with a block of text (if, by that, you mean multiple lines of text).  And it's often useful when combined with Zone OCR (which has an option to extract multiple lines of text) to get the text from one part of page and then narrow it down to just the data you want with pattern matching.

 

As for your question about grabbing information under specific text... it depends on the context and what else is around it, but generally, no this isn't something Quick Fields is good at.  However, if you have access to Capture Profiles (via a Cloud account or the upcoming "hybrid" functionality for self-hosted installations), then it's trivial: there's a zone positioning option "relative to specific text" (also called "anchored zones") that you would use.

1 0
replied on January 24, 2022 Show version history

Unfortunately, the names don't all have a specific word in them that would work well with pattern recognition.

But I appreciate your "Form Alignment" suggestion.  I'll give that a try.  

0 0
replied on January 24, 2022

I'll second using hybrid Capture Profiles with anchored zones once they're available (Very Soon™).

1 0
replied on January 25, 2022

The hybrid Capture Profiles does sound like exactly what I need, hopefully it is available soon.

 

Alternatively, I'm playing with trying to use Pattern Matching to capture the line after a line that contains:

LEGAL FAMILY NAME          LEGAL FIRST NAME             LEGAL MIDDLE NAMES
But I'm really not great with Regular Expressions, so I'm not getting the result I expect.

I've tried this, which reading through I think is correct:

LEGAL FAMILY NAME\sLEGAL FIRST NAME\sLEGAL MIDDLE NAMES\n[A-Za-z '-]*[A-Za-z'-]

 

I did have some luck capture the line of text if I used a line number, but I've found that my electronic documents are getting random characters added at the top that messes up the line numbering.  I have no idea why a letter "t" would be added, as the documents are pristine.  

0 0
replied on January 25, 2022

You're missing a few things, I think:
1) You probably want quantifiers on those \s since there could be multiple whitespace characters between the name columns
2) There's actually two invisible line break characters: new line (\n) and carriage return (\r).  Sometimes, but not always, you need both (I don't recall specifically for the OCR text returned by Quick Fields).
3) If you want to capture the entire next line, you might use a character class like this: [^\r\n]* that means "match any character except a line break"

1 0
replied on February 1, 2022

Thanks - this works perfectly in test, but not in the actual document.  I added \s+ between the words, just in case the OCR was adding spaces. 

 

Here's what I'm using:

LEGAL\s+FAMILY\s+NAME\s+LEGAL\s+FIRST\s+NAME\s+LEGAL\s+MIDDLE\s+NAMES\r\n([^\r\n]+)

0 0
replied on February 1, 2022

You'll need to inspect the generated text of the actual documents (in the Text pane of Quick Fields) and see what unexpected characters OCR introduced and then account for those in your pattern.

 

Also, my gut reaction looking at your pattern is that it's actually too specific--the odds of one of those dozens of characters being misread by OCR is high, which would mean there was no match.  Your pattern only needs to be specific enough that it doesn't match anywhere else on the page (e.g., "MIDDLE\s+NAMES\r\n..." might be enough whereas "NAMES" probably wouldn't be).

1 0
replied on February 2, 2022

Played around with it, and I've got it working with just the word NAMES.

 

NAMES\r\n([^\r\n]+)

 

The fascinating thing is - I don't have tell it which lines to look on, it works across the entire document.

 

Thank you for this - you got me on the right path!

0 0
replied on January 24, 2022

Is there a way to capture a name that is in a box with "LEGAL FAMILY NAME" as the header, and the name as the second line?

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.