You are viewing limited content. For full access, please sign in.

Question

Question

Correcting minor OCR issue using Pattern Matching

asked on January 24, 2022

I'm happy with most of my Pattern Matching solution for capturing data from a scanned form.  I have a couple of records that the OCR is recognizing with an extra space after the surname.  I believe because the student's name ends in the letter "I", so it looks like there is more space between the characters.

ie) SURNAME , FIRSTNAME

where the correct version is:

SURNAME, FIRSTNAME

This throws pattern matching off, and gives me an incorrect result because it finds a correct result in the address.  

Is there a way I can account for these rare occasions in the coding?  

0 0

Replies

replied on January 24, 2022

Sure, instead of a pattern like [A-Z]+, you could break it up and capture multiple parts and remove the space(s) in the process: ([A-Z]+)\s+(, [A-Z]+)

Note that my regular expression of [A-Z]+ is not sufficient for names.  Names are notoriously difficult to capture because they can have spaces, apostrophes, suffixes, and more (e.g., Billy d'Bob III, Jr.).  But you can certainly handle "most" cases pretty well and flag the others for manual review.

1 0
replied on January 25, 2022

Thanks.  This is really helpful.  I know part of my issue is getting a handle on the regular expressions, but as I see more examples I think I am starting to get it.

 

Can I use multiple option characters in the pattern matching?  For example, another of my scans OCRs the comma between the names as a period.  

Bunny. Bugs instead of Bunny, Bugs

Could I do something like ([A-Z]+)\s+([,.]+[A-Z]+)?

 

0 0
replied on January 25, 2022

Yes, though I suggest you don't have the + on [,.]+ unless you're having issues with OCR reading the comma as multiple commas/periods.  And you'll probably still want the space (or \s+) between that and the first name.

I.e., ([A-Z]+)\s+([,.] [A-Z]+)

0 0
replied on January 25, 2022

Also, lets build the pattern to allow for:

  • both upper and lower case
  • there is not always a space between the last letter of the Surname and the comma
  • there may not always be a space between the comma and FirstName

 

([A-Za-z]+)\s*([,\.]\s*[A-Za-z]+)

1 0
replied on January 25, 2022

Thanks guys!  I'll give this a try right now, let you know how it goes.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.