Looking for a way to OCR a document and generate field data from the OCR.

SELECTED ANSWER

replied on September 22, 2020

Hi Timothy,

If we eliminate Quick Fields from this process, something you could try is using the "Retrieve Document Text" activity in Workflow combined with the "Pattern Matching" activity to try to extract the relevant data out of the text. You would definitely need to take a close look at the text that gets generated on these documents to see if this would be viable. Similar to Zone OCR in Quick Fields, you're going to need the document text to be very similar on each document so you can pinpoint the data you're looking for with the Pattern Matching activity.

You can configure some settings on these users so that text is generated on all new documents that they import, or you can use the Distributed Computing Cluster to schedule OCR on documents.

Either way, if you verify that the generated text is something that Workflow can work with, then using the data to name and route the documents within your repository should be very doable.

Let me know if I should clarify anything there.

2 0

View 16 previous replies

replied on September 23, 2020 • Show version history

That sounds great, now I've never done Pattern Matching so I'll need to learn Regular Expressions but maybe you could save me some time, I'd be looking for the work Influenza, as well as a date on the same row, is it possible in Pattern Matching to have it find the word Influenza and then look at characters to the left until it sees a date?

Edit : This is mostly because I can not guarantee that the document will not have the rows in the same order but the rows themselves should always be filled with the same set of information, Type of Vaccine, Manufacturer, Lot #, Who Gave The Injection, Date of Injection, and Date of Expiration. Might not be that exact order but the columns will always be in the same place.

0 0

replied on September 23, 2020

I'm not an expert with regular expressions, but I did a little testing and the following seems to work:

(\d+/\d+/\d+).*influenza

What that is saying (to the best of my knowledge) is to look for 3 sets of digits separated by slashes (a date format of 01/01/2020, for example), and capture that (assuming that's what your dates look like on these documents).

I'm 99% sure there is a better way to handle the rest of the regular expression, though (.*influenza). A period refers to any character (as opposed to \d which refers to a digit character), and an asterisk means zero or more matches. So what I'm trying to do is basically say "look for a date and capture it if it's followed by more characters, with the word "influenza" in there too."

I tested this with 3 lines of characters. Each line had a date, but only the 2nd line had the word "influenza" in it, too. This regular expression grabbed the correct date, but there is probably a better expression that you can use - I'm just not knowledgeable enough to know what it would be.

Does that help?

1 0

replied on September 23, 2020

That's amazing, thanks!

0 0

replied on September 23, 2020 • Show version history

So maybe I am doing something wrong, so the text shows "4.Influenza Quad (Iiv4), pf,
0.5mL, 6m and up
01/24/2020 0.5 mL Right Deltoid 5RS7Z 19515-0906-
41
GlaxoSmithKline 06/30/2020 "

so I modified what you sent to be influenza.*(\d+/\d+/\d+) so that it should find influenza then look past it for any set of three numbers broken up by two /s but I get no results back.

However I know the expression works because it works on "influenza sjdkfasdkjf 01/01/2020" and returns 1/01/2020

Am I missing something or is there something in the generated text that breaks it?

0 0

replied on September 23, 2020

Heck maybe I am missing something basic, when I test it with "influenza GlaxoSmithKline 06/30/2020 01/01/2020" I still get a date result of 1/01/2020 where it should be 06/30/2020

0 0

replied on September 23, 2020

That's what I'm getting, too - the 2nd date rather than the first one that comes after "influenza." This is probably a simple fix for someone who knows regular expressions better than me but I need to get better at them so I'll see what I can come up with!

0 0

replied on September 23, 2020

Awesome, I am going to play around with it and see what I can find. I think part of it may be the test function in general, when I use the whole block of text from the document it doesn't work, if I start deleting thing between the date I want and Influenza it will randomly work properly and then if I add in the last character and remove it it doesn't work. So I do not super trust the Test function but I'll play around with it as well. If I do not see a reply back and I find something I'll share it here as well.

0 0

replied on September 23, 2020

Hi Timothy

By default patterns catch only the first result. You can change the drop down here to say "All results as multi-value token". Once you do this in your test, you will see all dates found.

A good pattern for dates would be \d\d?/\d\d?/\d\d\d?\d?

Once you have all the values in a multi-value token, you can select the index you would like to use when referencing the token

1 0

replied on September 23, 2020

This might work, or at least help:

influenza.*?(\d+/\d+/\d+)

The question mark after the asterisk is telling the * (zero or more matches) to be non-greedy - to match the minimum number of times as opposed to the maximum number of times (the default).

Line breaks and carriage returns could also throw a wrench in this, and those can be hard to differentiate in generated text. But I'm curious if adding this question mark helps at all.

0 0

replied on September 23, 2020

Using both suggestions helped with some of the smaller tests but when having it check the original group of text it still fails for some reason. Any idea?

Using this : influenza.*?(\d\d?/\d\d?/\d\d\d?\d?)

Tests with stuff like this : influenza asg uhasdf54 asdfi9kjbn nu adnlf/QW234 01/01/2020 Works just fine.

However, if I use the original bit of text and manually type it it works fine, I think there are line or page breaks and that may be the issue like Jacob Hlas suggested.

As for the inclusion of the ? before the date Jacob, it works great, it makes the date show as 01/01/2020 rather then 1/01/2020 so that was an excellent recommendation, thanks!

0 0

replied on September 24, 2020

@████████ has a great comment in this thread. Might help you out - his detailed explanation has some stuff I'm going to refer back to quite a bit in the future, I'm sure.

0 0

replied on September 25, 2020

If there was line breaks in your copied text, when you paste it in the window it should show as visually new lines, so you should be able to catch them.

You can also use the program Notepad++ to look for any hidden characters in your text

0 0

replied on September 28, 2020 • Show version history

Chad, I see the line breaks, how do I work with them?

This is what the test looks like

0 0

replied on September 28, 2020 • Show version history

The line break character in regex is \r\n

It is actually 2 characters where \r returns to the start of the line and \n moves down to the next line

Since a new line always contains a \r and a \n you could actually look for either, but best to include both so that you don't capture part of it in your result

1 0

replied on September 28, 2020

Sorry, I am not 100% certain how to go about doing that? How would I use /r /n in the regular expression?

0 0

replied on September 28, 2020

I hijacked that thread mentioned above and was able to get help solving it.

(?:Influenza.*\n.*\n)(\d\d\/\d\d\/\d\d\d\d) works perfectly.

1 0

replied on September 28, 2020

Ok, looks like . includes \r if this works, so by specifying .*\n your saying anything, including \r and \n. I am not sure why . would include \r and not include \n though.

0 0

replied on September 28, 2020

No clue, but it seems to work in practice, I'll need more employee docs before I can try this in production but for now that is looking good.

Though you may be able to help with this;

So I need to find the employee first and last name to fill in their respective fields

Text looks like this:

Record generated by eClinicalWorks EMR/PM Software (www.eclinicalworks.com)
DOE, JANE H, F, 01/01/1975

I can grab the Last Name with (?:Record generated.*\n)(\w*) thanks to what I learned from Chris but I am having issues grabbing Jane for the first name.

I've tried :

(?:%(PatternMatching_Employee Last Name).*)(\w*)

and

%(PatternMatching_Employee Last Name).*?(\w*)

Neither seem to work, is there a better way to grab "JANE"?

0 0

replied on September 28, 2020 • Show version history

Once you have the first name, it is easy to get the last name using the comma delimiter

First Name: (?:Record generated.*\n)([^,]+)

Last Name: (?:Record generated.*\n)[^,]+,([^,]+)

[^,] means a character that is not a comma

For example [^k] would mean a character that is not a k

You can also include more than one character

[^abc] would be any character that is not a, b, or c

1 0

Question

Question

Looking for a way to OCR a document and generate field data from the OCR.

Answer

Replies

Sign in to reply to this post.