You are viewing limited content. For full access, please sign in.

Question

Question

Date issues in Quick Fields. Need to validate and increase accuracy.

asked on November 17, 2020

Hello Everyone! I have a massive QF session going right now for forms that are scanned in. I'm getting really good results for the most part but I'm having trouble with this date situation. Pretty much the dates are being OCR'ed and for the most part, they come through as 10 22 20 10 22 10. This is what it looks like on the form

I don't care which date I get, either the From or To date. It's usually the same. I have all kinds of image enhancements to try and clean these up because they come from multiple sources and sometimes the color is darker, sometimes the alignment is lower, sometimes they use a 4 digit year, sometimes they are smashed together. So I'm trying to account for all this. The pattern I'm using grabs 2 digits, then 2 digits, then either 2 or 4 digits, and I get back a string like this "102220". I then create a date by making a token and add in the slashes after the first 2 digits and then the next 2 to get "10/22/20". The problem comes in when I get the correct amount of digits but a 1 turns into a 9(or any other combination of digits being misread) which then gives me 911020 or something. this then turns into 91/10/20 which is not a valid date.

How do I check for valid dates in QF?

I can't find a way to do it. I was hoping that Formatting a token with MM/dd/yyyy would clear it out or something but it doesn't. There is no date calculator to get me some message I could do something with. I even tried doing a text length function to make sure I have 10 or 8 characters just incase a digit gets dropped. In pattern matching, I can say use no values if a match is not found. I would like something like that. If not a valid date, clear it out. This is all run by QF agent so I'll route these to a manual review folder if the date is missing. I would rather that than it getting stuck in QF. 

My wish list would be something like the first 2 digits need to be between 01 and 12, the second set is 01 to 31 and third set should be 12 to 40 or 2012 to 2040. That way for the first set is OCR'ed wrong, it could then look at the next set of digits. I have worked with other programs that I could give it values to expect, not just \d or something. 

Thank you for your help. 

0 0

Replies

replied on November 20, 2020 Show version history

Where in the process are you wanting to trap invalid dates? If your template field is set to be a date, it would prevent you from saving invalid dates.

That being said, give this a try:

(0[1-9]|1[012])(?:\s+)(0[1-9]|[12][0-9]|3[01])(?:\s+)([1-4][0-9])

It basically tries to enforce the basic rules of dates like you were describing. One thing it doesn't do is check to make sure you don't end up with 2/31 or something like that. Once you start trying to validate each month's days, the length of the regex will balloon. Then, how would you deal with leap years? 😵 It would be at least five times longer. Another hard part is the year range at the end.

[1-4][0-9]

Regex doesn't do ranges of integers beyond single digits. The one I added basically will check from 10-49. Any more detail would result in more complexity.

If this works for you, then cool. Otherwise, post back here with any refinements you'd like help with. Keep in mind that certain characters are more likely to be mistaken for each other. So, just by restricting the range a little, you will get more accuracy. You might also try applying a local color removal to your OCR process. That will result in better recognition. Finally, which engine optimization are you using for your OCR? It takes a lot of volume before you start to see issues with using the "Accuracy" setting.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.