Help with Regular Expression string required to extract data to a token

replied on October 25, 2019 • Show version history

On the assumption that the sequence of 9 letters and numbers is always preceded by an underscore and succeeded with an underscore the the following patter should do the trick:

_([\w\d]{9})_

\w = any word character
\d = any number character
Parentheses indicate that only the characters within them should be extracted
{9} = Finds a sequence of 9 characters that match the \w\d pattern

2 0

View 7 previous replies

replied on October 25, 2019

Yes, that worked, thank you, but for the following file it throws the following result. I am confused as to why that is the case. Any ideas? It should work correctly based on the expression you provided.

e.g. E_C_COMPANY_190186961_PUY_PUM_NOT

Result = C_COMPANY

0 0

replied on October 25, 2019

This is the original one I used but later shortened it. It must consider the underscore to be a word character - welcome to the quirky world of regular expressions! Trial and error is key!

_([a-zA-Z0-9]{9})_

2 0

replied on October 25, 2019

This works with all the examples so far. I will test it out further. Thanks again for your quick response. Much appreciated!!

0 0

replied on October 31, 2019

Nigel, For some reason, it doesn't work in the following scenario. I cannot figure out a way to file this, any ideas?

BROOKS_NATHANIEL_LYNN_9Z0519242_PUY_PUM_NOT.PDF

0 0

replied on November 4, 2019

Hi Amit - in that case it is returning "NATHANIEL" because it matches the sequence of 9 letters or numbers within underscore characters.

Are there any other patterns in there we can take for granted? For example, does "PUY_PUM_NOT" always appear in the filename? Or maybe just "PUY"? Or does the string you want to return always end with a number?

0 0

replied on November 4, 2019

Can you assume there is at least 2 underscores before the data to extract?

[^_]+_[^_]+_([^_]{9})_.*

It will reduce false hits like BROOKS_NATHANIEL_LYNN_9Z0519242_PUY_PUM_NOT.PDF, but cannot completely eliminate them as BROOKS_LYNN_NATHANIEL_9Z0519242_PUY_PUM_NOT.PDF would still throw a false hit.

Can you assume that there will always be only 3 underscores following the desired text?

_([^_]{9})_[^_]+_[^_]+_[^_]*$

0 0

replied on November 4, 2019

yes, PUY_PUM_NOT always appears at the very end of the filename.

And the string always ends with a number.

There is only 1 underscore between the names and case numbers.

Is it not possible to look for _PUY and then extract the 9 letters or numbers before this string?

BROOKS_NATHANIEL_LYNN_9Z0519242_PUY_PUM_NOT.PDF

If you can help me figure out the above for the workflow, then the 2nd workflow i am working on might be easier or not. This one does does not have any underscore (only spaces) between the names and always ends with PUY PUM.pdf This expressions is independent of the one above.

Case number like before is 9 numbers or letters, could be all numbers, or a mix of numbers and letters. I have listed the 3 formats below.

BROOKS NATHANIEL LYNN C00519242 PUY PUM.PDF

BROOKS NATHANIEL LYNN 9Z0519242 PUY PUM.PDF

BROOKS NATHANIEL LYNN 900519242 PUY PUM.PDF

Thanks for the help, i so appreciate it.

0 0

replied on November 4, 2019

I got this to work using the following - (\S{9})_PUY_PUM_NOT

and for the 2nd one - (\S{9})\sPUY\sPUM

All is well. Thanks for your help guys!!

0 0

replied on November 5, 2019

If the desired data is guaranteed to always be 9 characters long and always end in a number, then an expression like this should work

[\s_]([^\s_]{8}\d)[\s_]

[\s_] - tells it a white character (space) or an underscore

Then it starts the capture group

[^\s_]{8} - tells it 8 characters that are not white character (space) or underscore

\d - any single numeric character 0 - 9

Then it ends the capture group and is followed by

[\s_] - tells it a white character (space) or an underscore

With Regular Expressions, there are many ways to get the same thing, so you always need to provide as many constants as possible to be able to pick patterns that you can use to build the Regular Expression on.

1 0

replied on November 5, 2019

Hi Bert, Yes, its always 9 characters and ends in a number.

I tested your expression and this works for both cases. Very helpful!!

Thanks a lot. This has been a great learning exercise.

~Amit

0 0

Question

Question

Help with Regular Expression string required to extract data to a token

Replies

Sign in to reply to this post.