You are viewing limited content. For full access, please sign in.

Question

Question

How to exclude characters that are misread with OCR with regular expression

asked on December 28, 2016

Hello,

I have been researching this and I'm having trouble getting it to work. I have QF sessions to read amounts off invoices. Sometimes it picks up characters that I don't want. Right now I use this pattern for matching amounts. 

(\d+\S+\D\d{2})

It works really well until today when I noticed it was reading a , as a ). 

13)559.30

I'm trying to use the [^\)]+ in the token editor but it will only then match the first digits and not the rest. 

 

How do I get it to just exclude the ) and keep the rest? Thanks. 



 

0 0

Answer

SELECTED ANSWER
replied on December 29, 2016

As you may get other none numeric characters instead of ), this would allow for up to 3 of those characters to appear in the string but still create the correct output. Example below

(\d*)\D*?(\d*)?\D*?(\d*)?\D*?(\d*)(\.\d{2})

0 0
replied on December 29, 2016

That's a great one. It's always hard to predict what OCR will see on a scanned document so haveing a pattern be able to exclude the stuff we don't want is great. I have also had problems doing multiple capture groups but maybe QF is working better with that now. Thanks for the help. 

0 0

Replies

replied on December 28, 2016

Hi Lucas,

Would the following expression work for your needs? It will grab everything before a ")" or ",", and everything after.

(.*)\)?,?(.*)

You could make it explicit to only accept digits like this:

(\d{0,3})\)?,?(\d{0,3}\.\d{2})

1 0
replied on December 28, 2016

Thanks Tom

The first pattern worked in regex101.com but not in the test area of the token editor but the second pattern worked great. Thanks. I was researching and I got focused on the ^ function to try to exclude the character.  

 

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.