You are viewing limited content. For full access, please sign in.

Question

Question

Need Workflow to find token value (specific numbers in document text and retrieve next word in same line (Reg Expression)

asked on August 12, 2021

I have a workflow that successfully finds a regular expression in a document and copies that expression to the parent folder.  What I need it to do is find the one on the line that starts with a token value and copy the matching expression on the same line.  Can this be done?

0 0

Replies

replied on August 12, 2021

You'll likely need  a Pattern Matching activity. Inline regular expressions don't resolve token values in the pattern.

0 0
replied on August 13, 2021

Hi Miruna:  I have been working on this workflow after adding the pattern matching and I am so close I can taste success!!

My pattern matching worked on one test and the workflow successfully achieved what I wanted, however, on this second test it is not.  In troubleshooting, it seems that it is not working because the "Retrieve Document Text" is reading down the columns and not across the lines horizontally!

The pattern matching is supposed to be grabbing the property description found immediately after the roll number, but instead it is grabbing the roll number immediately below the target roll number:  %(Roll Number)\s*(.*)[\r]

Can workflow be made to remove lines like in Quick Fields?

Reading this way: 

Instead of this way: 

0 0
replied on August 16, 2021

That's because (.*) tries to match as much as possible, so it matches everything from 370100 all the way to the final \r.

You need to make it less greedy by excluding newlines from it. Something like [^\r\n] might get you closer ("anything but a newline character").

0 0
replied on August 18, 2021 Show version history

I have tried that and a number of different options now, but I'm still running into what looks like:  the text retrieval read the columns first, so the pattern matching literally cannot see what came after each roll number.  So, I guess my question is now, "How can I make the Retrieve Text activity read across each line instead of down each column before moving on to the next column?

The SW-01-044-13-4 address you see in the test value is actually the address that shows up on the same line as the 370100 number on this particular test page.

0 0
replied on August 18, 2021

Retrieve Text does not do any processing, it just reads the text page as-is (and as you'd see if in the web/desktop client). In this case, assuming your test value came from the page, it looks like whenever the image was OCRed, it was done with de-columnize on, so instead of reading the page line by line, the OCR engine was instructed to do it by table columns.

Workflow can't fix that, you'll have to re-OCR the image to fix the text.

1 0
replied on August 18, 2021

That was it, Miruna!  That is why it was gathering the text the wrong way.  Thanks!  Now I need to see why they were OCR'ing that way and see if I can safely make that change on all the targeted records! 

Also, I need to find a way to make it stop at the end of the address I want collected.  I don't want the 163.0 that is in the third column.

0 0
replied on August 19, 2021

You're still using (.*) which will try to get as much as possible, so it will need some more narrowing down. If those addresses don't have spaces in them, try something like ([^\r\n\s]+)  instead of (.*) ("at least one char, but not a newline or a space"). Or if they always follow that format of 5 character groups delimited by a dash, we can work with that.

1 0
replied on August 19, 2021

Your new combination works, as well as \s*(.*)[^\d?\.\r\n]

But:  In testing the regex, I'm getting the right result.  In testing the workflow, I'm getting either a blank result or the roll number (365700) instead of the address (SE-28-043-13-4)

 

When actually running the workflow:

0 0
replied on August 20, 2021

Right, because you're using it on the Roll Number value. You have 365700 and are trying to get any character or any character that's not a newline or a space. So that matches all digits there.

So we need to look at why your Roll Number token returns just the number and not the rest of the line. How is that token defined?

0 0
replied on August 20, 2021 Show version history

Here is a look at the Roll # config.  What I need it to do is take the parent folder name with a naming convention that requires 8 digits for the roll # (due to how another workflow operates and creates the folders for me) so in testing, I'm telling it to remove two zeros so I can get the test number that has 6 digits.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.