Correcting minor OCR issue using Pattern Matching

replied on January 24, 2022

Sure, instead of a pattern like [A-Z]+, you could break it up and capture multiple parts and remove the space(s) in the process: ([A-Z]+)\s+(, [A-Z]+)

Note that my regular expression of [A-Z]+ is not sufficient for names. Names are notoriously difficult to capture because they can have spaces, apostrophes, suffixes, and more (e.g., Billy d'Bob III, Jr.). But you can certainly handle "most" cases pretty well and flag the others for manual review.

1 0

View 1 previous reply

replied on January 25, 2022

Thanks. This is really helpful. I know part of my issue is getting a handle on the regular expressions, but as I see more examples I think I am starting to get it.

Can I use multiple option characters in the pattern matching? For example, another of my scans OCRs the comma between the names as a period.

Bunny. Bugs instead of Bunny, Bugs

Could I do something like ([A-Z]+)\s+([,.]+[A-Z]+)?

0 0

replied on January 25, 2022

Yes, though I suggest you don't have the + on [,.]+ unless you're having issues with OCR reading the comma as multiple commas/periods. And you'll probably still want the space (or \s+) between that and the first name.

I.e., ([A-Z]+)\s+([,.] [A-Z]+)

0 0

replied on January 25, 2022

Also, lets build the pattern to allow for:

both upper and lower case
there is not always a space between the last letter of the Surname and the comma
there may not always be a space between the comma and FirstName

([A-Za-z]+)\s*([,\.]\s*[A-Za-z]+)

1 0

replied on January 25, 2022

Thanks guys! I'll give this a try right now, let you know how it goes.

0 0

Question

Question

Correcting minor OCR issue using Pattern Matching

Replies

Sign in to reply to this post.