You are viewing limited content. For full access, please sign in.

Question

Question

Pattern maching on data with dots and dash

asked on September 10, 2014 Show version history

 

Hi

We are trying to extract information with patter Matching from fields that can contain dots and dashes, when we uses this patterns REF:\s*((\w*\s*)+)((\w*\.\s*)+), sometimes it works with names that have several words and dots, but on other occasions only retrieve one or two words and leave all others. Also we have documents that have dashes, how can we do the same thing with dashes.On occasions it duplicates the last word.

 

Thanks for the help.

 

Ricardo Cairo

sample.pdf (27.05 KB)
0 0

Answer

SELECTED ANSWER
replied on September 11, 2014

REF:\s*([^\n]+[^\s])\s*HOTEL   is a more permissive version, it should return the expected values on "REF: THE MARK TRVL. CORP. FUNJET VAC.".

 

I'm not sure what you mean about the pattern not working on the TTOO field. Did you mean that you used it as-is in the TO field? It's not expected to work since you don't have the "REF:" starting value or the "HOTEL" at the end.

 

I can't reproduce the part about the missing spaces. Are you sure they exist in your OCRed text?

0 0

Replies

replied on September 10, 2014

The attached PDF is not useful since the data is redacted. Can you post a few sample strings where pattern matching does not return what you want?

0 0
replied on September 10, 2014

Hi Miruna

A few samples:

 

REF: THE MARK TRAVEL - GRUPOS

REF: EXOTIC DIAL & EXCHANGE MEMBER - EXDEM

REF: THE MARK TRVL. CORP. FUNJET VAC.

REF: MAR REAL ESTATES S.L.U.

 

 

 

 

0 0
replied on September 10, 2014

Have you tried 

 

REF:\s*(.+)

0 0
replied on September 10, 2014 Show version history

Yes, but it takes the whole line and more, how can I stop the retrieval at the word hotel?.

And it won´t retrive the dots and dashes.

 

0 0
replied on September 10, 2014

REF:\s*([^\n]+\w)\s+HOTEL  

 

The [^\n] regular expression covers any character except for newline. It should take care of matching any periods or dashes that you may have. The \w ensures you're not picking up any additional spaces that may appear in between the value and the word "HOTEL".

1 0
replied on September 11, 2014 Show version history

Thanks Miruna, this solution work on picking up all the information but don´t retrieve the dots and dashes. And on this one: REF: THE MARK TRVL. CORP. FUNJET VAC.does not pick anything

 

 

0 0
replied on September 11, 2014

Miruna,
Another thing, we have another field that picks the invoice number and use it to name the document. When we put this pattern REF:\s*([^\n]+\w)\s+HOTEL: on the field TTOO, the first one doesn´t pick up anything.
 

0 0
replied on September 11, 2014

My mistake, on the dots and dashes, we are using the patern \w+ to remove the spaces, but it also removes the dashes.

But we find another thing the pattern REF:\s*([^\n]+\w)\s+HOTEL: removes all the spaces on another places not only the trailing spaces.

 

EXOTIC DIAL&EXCHANGE MEMBER-EXDEM must be EXOTIC DIAL & EXCHANGE MEMBER - EXDEM

0 0
SELECTED ANSWER
replied on September 11, 2014

REF:\s*([^\n]+[^\s])\s*HOTEL   is a more permissive version, it should return the expected values on "REF: THE MARK TRVL. CORP. FUNJET VAC.".

 

I'm not sure what you mean about the pattern not working on the TTOO field. Did you mean that you used it as-is in the TO field? It's not expected to work since you don't have the "REF:" starting value or the "HOTEL" at the end.

 

I can't reproduce the part about the missing spaces. Are you sure they exist in your OCRed text?

0 0
replied on September 11, 2014

No the OCR text does not have the spaces, it's this an issue with the OCR engine?

TTOO is the field where I store the result of the pattern and when I put this pattern I get several invoices where the number it's not retrive (I use another pattern to retrive the invoice number).

1 0
replied on September 17, 2014

I have found when you are OCRing at "Balanced" it has a tendency to remove spaces where the kerning of the fonts is very small (i.e. the spaces are very small) but "Accurate" doesn't remove as many. 

 

Of course, it is quite a bit slower. 

1 0
replied on October 6, 2014

 

Hi


We still getting troubles with this session, the session separates the invoices but as empty documents and with no data on the fields.  On 8 invoices identify 2. We try with different settings and still not working.

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.