You are viewing limited content. For full access, please sign in.

Question

Question

Quickfields Pattern Matching regex help

asked on August 25, 2017

Hi all,

 

I have a process I'm working on in Quickfields (and I'm somewhat of a newb) and I need help with part of it. I have PDF files that are being brought in that are certificates of analysis for chemicals we have sold in the past. I need to search the contents of the documents and return back certain information to be applied as Metadata. The screenshot attached is from the relevant portion of the files. I want to pull out from these documents the text directly after the fields: Customer, Product, Batch Number, and PO Number. I am having trouble with the regex's I've tried and to be honest I'm really shooting in the dark on these.

 

Thanks in advance.

text identification.png
0 0

Replies

replied on August 25, 2017

Hi Preston - 

Regex is it's own little mini programming language, and you have to use it on a fairly regular basis to get good at it.  There are two good RegEx editors that can help quite a bit:  RegEx Buddy, and Expresso.  You can also use the testing tool in the Workflow Designer for the Pattern Matching task.

Anyway, this one is pretty easy due to the labels on the left.  

The first expression would be:

^Customer:\s\w+  

The hat (^) means Starts with.  "Customers:" is what you are then looking for, followed by white space (\s).  Then you are looking for the next "word" character (\w) with as many characters as it can fine (+, i.e. greedy).  A word character can be a letter or a number. That is what you are looking for.  Once you find it, you need to get the value and not the label, and this is done with the parens around \w+

So the full expression should be:  

^Customer:\s(\w+)

In plain English, find this phrase and then grab the last part.  Miruna is the master of RegEx, BTW.

 

 

1 0
replied on August 25, 2017

Thank you! Very helpful, however, bit of a problem. This works for the format Customer: Customer Name. Our documents are apparently set up to look more like this:

 

Customer:            Customer Name

 

Note that there are two spaces and two tabs from what I can tell. I tried adding in a bunch of \s's to accommodate the spaces, and that doesn't work. Could you point me in the right direction?

1 0
replied on August 25, 2017

Try using ^Customer:\s+(\w+)

The + sign means one or more matches.

1 0
replied on August 25, 2017

So I think to simplify I've looked at all the options in the OCR process, and if I decolumnize the ocr'd text, I can just write a regex to grab everything on the next line below where the pattern text is found. This will help because we have customers that have ' and . etc in their names and in testing, the original regex won't work. How could I write something that grabbed the line below my pattern?

0 0
replied on August 25, 2017

Hi Preston - To clarify, the problem is that you need to grab all of the text to the right of the label, including words separated by spaces?  If so, there is a way to do that...and it is far easier then looking at lines below.  I am not even sure RegExs can do that.

0 0
replied on August 26, 2017

OK, back for round 2.  This was challenging in a way that I did not expect.  First, I forgot to mention https://regex101.com , an awesome web based resource.  There are two questions here: 1) How to get everything on a line, including variable numbers of words and spaces, and 2) How to get everything on the next line.  I would use one OCR zone per line, as I think the results will be more predictable.  With that, this RegEx would work:

^Customer:\s+([^\r\n]+)

Which reads: Find "Customers:", find any number of spaces (\s+), and then find the end of the line, either Carriage Return or New Line.  Extract everything between the spaces and the end of the line. That will give you "ABC", "ABC Co", or "ABC Co Inc".  However, the OCR engine may not give you a end of line character.  In that case, your logic would be, capture everything between the spaces and some character (or group of characters) that you know will never appear.  If you use the pipe charcter, your regex would be:

^Customer:\s+([^|]+)

But you could add more improbable characters and it would be a little better qualified.  As in:

^Customer:\s+([^*|*]+) 

And it turns out you can also get the next line of text using a similar approach with the \r\n.  Here are some links to articles about that:

https://stackoverflow.com/questions/37526216/select-the-next-line-after-match-regex

https://stackoverflow.com/questions/6656215/regular-expression-to-capture-multiple-lines

But these get even more complex and I think simple is better.

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.