You are viewing limited content. For full access, please sign in.

Question

Question

Looking for a way to OCR a document and generate field data from the OCR.

asked on September 22, 2020

Constraints: Single Quick Fields Licenses and I am not the employee who will be scanning these documents in, multiple others will be so I don't think Quick Fields will work for this.

 

What I am looking for: I'd like users to be able to throw PDFs into a certain folder in the repository then have them OCR'd, fields applied, fields filled with information from the OCR, and finally, the documents to be sorted into the proper folders based on whose name is on the document \ Employee Name Fields.

 

My first thought would be to have a Workflow run when a document is added to Folder "x" and then OCR it but I don't know how to use that OCR to generate tokens to fill fields or to then move the documents to the right folders.

 

Any and all help would be greatly appreciated. 

0 0

Answer

SELECTED ANSWER
replied on September 22, 2020

Hi Timothy,

If we eliminate Quick Fields from this process, something you could try is using the "Retrieve Document Text" activity in Workflow combined with the "Pattern Matching" activity to try to extract the relevant data out of the text. You would definitely need to take a close look at the text that gets generated on these documents to see if this would be viable. Similar to Zone OCR in Quick Fields, you're going to need the document text to be very similar on each document so you can pinpoint the data you're looking for with the Pattern Matching activity.

You can configure some settings on these users so that text is generated on all new documents that they import, or you can use the Distributed Computing Cluster to schedule OCR on documents.

Either way, if you verify that the generated text is something that Workflow can work with, then using the data to name and route the documents within your repository should be very doable.

Let me know if I should clarify anything there. 

2 0
replied on September 23, 2020 Show version history

That sounds great, now I've never done Pattern Matching so I'll need to learn Regular Expressions but maybe you could save me some time, I'd be looking for the work Influenza, as well as a date on the same row, is it possible in Pattern Matching to have it find the word Influenza and then look at characters to the left until it sees a date?

 

Edit : This is mostly because I can not guarantee that the document will not have the rows in the same order but the rows themselves should always be filled with the same set of information, Type of Vaccine, Manufacturer, Lot #, Who Gave The Injection, Date of Injection, and Date of Expiration. Might not be that exact order but the columns will always be in the same place.

0 0
replied on September 23, 2020

I'm not an expert with regular expressions, but I did a little testing and the following seems to work:

(\d+/\d+/\d+).*influenza

What that is saying (to the best of my knowledge) is to look for 3 sets of digits separated by slashes (a date format of 01/01/2020, for example), and capture that (assuming that's what your dates look like on these documents).

I'm 99% sure there is a better way to handle the rest of the regular expression, though (.*influenza). A period refers to any character (as opposed to \d which refers to a digit character), and an asterisk means zero or more matches. So what I'm trying to do is basically say "look for a date and capture it if it's followed by more characters, with the word "influenza" in there too." 

I tested this with 3 lines of characters. Each line had a date, but only the 2nd line had the word "influenza" in it, too. This regular expression grabbed the correct date, but there is probably a better expression that you can use - I'm just not knowledgeable enough to know what it would be.

Does that help?

8.3.4
1 0
replied on September 23, 2020

That's amazing, thanks!

0 0
replied on September 23, 2020 Show version history

So maybe I am doing something wrong, so the text shows "4.Influenza Quad (Iiv4), pf,
0.5mL, 6m and up 
01/24/2020 0.5 mL Right Deltoid 5RS7Z 19515-0906-
41 
GlaxoSmithKline 06/30/2020 "

so I modified what you sent to be influenza.*(\d+/\d+/\d+) so that it should find influenza then look past it for any set of three numbers broken up by two /s but I get no results back.

However I know the expression works because it works on "influenza sjdkfasdkjf 01/01/2020" and returns 1/01/2020

Am I missing something or is there something in the generated text that breaks it?

 

0 0
replied on September 23, 2020

Heck maybe I am missing something basic, when I test it with "influenza GlaxoSmithKline 06/30/2020 01/01/2020" I still get a date result of 1/01/2020 where it should be 06/30/2020

0 0
replied on September 23, 2020

That's what I'm getting, too - the 2nd date rather than the first one that comes after "influenza." This is probably a simple fix for someone who knows regular expressions better than me but I need to get better at them so I'll see what I can come up with!

0 0
replied on September 23, 2020

Awesome, I am going to play around with it and see what I can find. I think part of it may be the test function in general, when I use the whole block of text from the document it doesn't work, if I start deleting thing between the date I want and Influenza it will randomly work properly and then if I add in the last character and remove it it doesn't work. So I do not super trust the Test function but I'll play around with it as well. If I do not see a reply back and I find something I'll share it here as well.

 

0 0
replied on September 23, 2020

Hi Timothy

By default patterns catch only the first result. You can change the drop down here to say "All results as multi-value token". Once you do this in your test, you will see all dates found.

A good pattern for dates would be \d\d?/\d\d?/\d\d\d?\d?

Once you have all the values in a multi-value token, you can select the index you would like to use when referencing the token

1 0
replied on September 23, 2020

This might work, or at least help:


influenza.*?(\d+/\d+/\d+)

The question mark after the asterisk is telling the * (zero or more matches) to be non-greedy - to match the minimum number of times as opposed to the maximum number of times (the default). 

Line breaks and carriage returns could also throw a wrench in this, and those can be hard to differentiate in generated text. But I'm curious if adding this question mark helps at all. 

0 0
replied on September 23, 2020

Using both suggestions helped with some of the smaller tests but when having it check the original group of text it still fails for some reason. Any idea?

 

Using this : influenza.*?(\d\d?/\d\d?/\d\d\d?\d?)

Tests with stuff like this : influenza asg uhasdf54 asdfi9kjbn  nu adnlf/QW234 01/01/2020     Works just fine.

However, if I use the original bit of text and manually type it it works fine, I think there are line or page breaks and that may be the issue like Jacob Hlas suggested.

As for the inclusion of the ? before the date Jacob, it works great, it makes the date show as 01/01/2020 rather then 1/01/2020 so that was an excellent recommendation, thanks!

0 0
replied on September 24, 2020

@████████ has a great comment in this thread. Might help you out - his detailed explanation has some stuff I'm going to refer back to quite a bit in the future, I'm sure.

0 0
replied on September 25, 2020

If there was line breaks in your copied text, when you paste it in the window it should show as visually new lines, so you should be able to catch them.

You can also use the program Notepad++ to look for any hidden characters in your text

0 0
replied on September 28, 2020 Show version history

Chad, I see the line breaks, how do I work with them?

This is what the test looks like

0 0
replied on September 28, 2020 Show version history

The line break character in regex is \r\n

It is actually 2 characters where \r returns to the start of the line and \n moves down to the next line

Since a new line always contains a \r and a \n you could actually look for either, but best to include both so that you don't capture part of it in your result

1 0
replied on September 28, 2020

Sorry, I am not 100% certain how to go about doing that? How would I use /r /n in the regular expression?

 

0 0
replied on September 28, 2020

I hijacked that thread mentioned above and was able to get help solving it. 

(?:Influenza.*\n.*\n)(\d\d\/\d\d\/\d\d\d\d) works perfectly.

1 0
replied on September 28, 2020

Ok, looks like . includes \r if this works, so by specifying .*\n your saying anything, including \r and \n. I am not sure why . would include \r and not include \n though.

0 0
replied on September 28, 2020

No clue, but it seems to work in practice, I'll need more employee docs before I can try this in production but for now that is looking good.

Though you may be able to help with this;

So I need to find the employee first and last name to fill in their respective fields

Text looks like this:

Record generated by eClinicalWorks EMR/PM Software (www.eclinicalworks.com)
DOE, JANE H, F, 01/01/1975

I can grab the Last Name with (?:Record generated.*\n)(\w*) thanks to what I learned from Chris but I am having issues grabbing Jane for the first name.

I've tried :

(?:%(PatternMatching_Employee Last Name).*)(\w*)

and 

%(PatternMatching_Employee Last Name).*?(\w*)

Neither seem to work, is there a better way to grab "JANE"?

0 0
replied on September 28, 2020 Show version history

Once you have the first name, it is easy to get the last name using the comma delimiter

First Name: (?:Record generated.*\n)([^,]+)

Last Name: (?:Record generated.*\n)[^,]+,([^,]+)

 

[^,] means a character that is not a comma

For example [^k] would mean a character that is not a k

You can also include more than one character

[^abc] would be any character that is not a, b, or c

1 0

Replies

replied on September 29, 2020

Hi Timothy,

A couple comments kind of backing things up a bit here. I honestly think that the best option is to budget for more QuickFields licenses with Zone OCR. I have trust issues with OCR in general.  Unless your users are quality controlling every single document, eventually you will have some inaccuracies with the OCR. So, using QF's is the safest bet imo. 

Next best thing (if possible) is to avoid the OCR altogether. Does the data on the document exist somewhere else in a database? If it does, have the entry name include a unique ID like record number, patient number, etc.. Extract the unique ID with Workflow and run a query where you ping the database for what you need. Cut OCR out of the conversation entirely. Actually, this should be the first thing you try lol. Rely on QF as next best. Less margin of error.

Last thing to consider is if you have to use OCR, be sure to bake into the ReqEx any exceptions you could possibly encounter. What if the OCR doesn't read things correctly, how will Workflow handle that? Maybe move all entries that yield no results to a separate folder where users can run a more enhanced setting for Generate Text. Such docs would need to be manually reviewed and learn what went wrong. Any errors you come across, modify your RegEx with so that it gets improved. This should be an ongoing process where you keep enhancing your RegEx or process. You will need all the help you can get with OCR. 

Just keep in mind, its not 'if' something goes wrong with the OCR, it's 'when' something goes wrong with it. Try to prepare for that and have a mitigation plan in place.

 

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.