You are viewing limited content. For full access, please sign in.

Question

Question

Help with Regular Expression string required to extract data to a token

asked on October 24, 2019

I am trying to use a pattern matching regular expression to extract the 9 characters in the middle of the file name string below to a token. So basically, in the 1st example below, i need "9Z0958097" to be extracted to a token and in the 2nd example i need "190188322". The string i need extracted is always 9 characters (either only numbers or mix of numbers and letters).

If it helps, the file name always ends with the same string of characters - _PUY_PUM_NOT

Any help with how to frame the regular expression would be appreciated.  Thank you.

Example File Names Below.

JOHN_D_DOE_A_9Z0958097_PUY_PUM_NOT.PDF

JOHN_DOE_190188322_PUY_PUM_NOT.PDF

JOHN_D_DOE_C00072108_PUY_PUM_NOT.PDF
 

0 0

Replies

replied on October 25, 2019 Show version history

On the assumption that the sequence of 9 letters and numbers is always preceded by an underscore and succeeded with an underscore the the following patter should do the trick:

_([\w\d]{9})_

\w = any word character
\d = any number character
Parentheses indicate that only the characters within them should be extracted
{9} = Finds a sequence of 9 characters that match the \w\d pattern

2 0
replied on October 25, 2019

Yes, that worked, thank you, but for the following file it throws the following result. I am confused as to why that is the case. Any ideas? It should work correctly based on the expression you provided.

e.g. E_C_COMPANY_190186961_PUY_PUM_NOT

Result = C_COMPANY

0 0
replied on October 25, 2019

This is the original one I used but later shortened it. It must consider the underscore to be a word character - welcome to the quirky world of regular expressions! Trial and error is key!

_([a-zA-Z0-9]{9})_

2 0
replied on October 25, 2019

This works with all the examples so far. I will test it out further. Thanks again for your quick response. Much appreciated!!

0 0
replied on October 31, 2019

Nigel, For some reason, it doesn't work in the following scenario. I cannot figure out a way to file this, any ideas?

BROOKS_NATHANIEL_LYNN_9Z0519242_PUY_PUM_NOT.PDF

0 0
replied on November 4, 2019

Hi Amit - in that case it is returning "NATHANIEL" because it matches the sequence of 9 letters or numbers within underscore characters. 

Are there any other patterns in there we can take for granted? For example, does "PUY_PUM_NOT" always appear in the filename? Or maybe just "PUY"? Or does the string you want to return always end with a number?

0 0
replied on November 4, 2019

Can you assume there is at least 2 underscores before the data to extract?

[^_]+_[^_]+_([^_]{9})_.*

It will reduce false hits like BROOKS_NATHANIEL_LYNN_9Z0519242_PUY_PUM_NOT.PDF, but cannot completely eliminate them as BROOKS_LYNN_NATHANIEL_9Z0519242_PUY_PUM_NOT.PDF would still throw a false hit.

 

Can you assume that there will always be only 3 underscores following the desired text?

_([^_]{9})_[^_]+_[^_]+_[^_]*$

 

0 0
replied on November 4, 2019

yes, PUY_PUM_NOT always appears at the very end of the filename.

And the string always ends with a number.

There is only 1 underscore between the names and case numbers.

Is it not possible to look for _PUY and then extract the 9 letters or numbers before this string?

BROOKS_NATHANIEL_LYNN_9Z0519242_PUY_PUM_NOT.PDF

 

If you can help me figure out the above for the workflow, then the 2nd workflow i am working on might be easier or not. This one does does not have any underscore (only spaces) between the names and always ends with PUY PUM.pdf This expressions is independent of the one above.

Case number like before is 9 numbers or letters, could be all numbers, or a mix of numbers and letters. I have listed the 3 formats below.

BROOKS NATHANIEL LYNN C00519242 PUY PUM.PDF 

BROOKS NATHANIEL LYNN 9Z0519242 PUY PUM.PDF 

BROOKS NATHANIEL LYNN 900519242 PUY PUM.PDF 

 

Thanks for the help, i so appreciate it.

0 0
replied on November 4, 2019

I got this to work using the following - (\S{9})_PUY_PUM_NOT

and for the 2nd one -  (\S{9})\sPUY\sPUM

All is well.  Thanks for your help guys!!

0 0
replied on November 5, 2019

If the desired data is guaranteed to always be 9 characters long and always end in a number, then an expression like this should work

[\s_]([^\s_]{8}\d)[\s_]

[\s_] - tells it a white character (space) or an underscore

Then it starts the capture group

[^\s_]{8} - tells it 8 characters that are not white character (space) or underscore

\d - any single numeric character 0 - 9

Then it ends the capture group and is followed by

[\s_] - tells it a white character (space) or an underscore

 

With Regular Expressions, there are many ways to get the same thing, so you always need to provide as many constants as possible to be able to pick patterns that you can use to build the Regular Expression on.

1 0
replied on November 5, 2019

Hi Bert, Yes, its always 9 characters and ends in a number.

I tested your expression and this works for both cases. Very helpful!!

Thanks a lot. This has been a great learning exercise.

~Amit

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.