You are viewing limited content. For full access, please sign in.

Question

Question

Pattern Matching question about filtering out multiple parts of file name

asked on November 9, 2017

I'm writing a workflow to populate a template with info based on the name of the file.  

The files will be formatted like this example:  MISC_RTI_123016  

The template has 3 fields.

I need to populate one field with MISC, another field with RTI, and another field with 123016.

The length and the name of the file varies, but it will always be TEXT_TEXT_NUMBER.

What would be the correct regex to use with pattern matching in order to isolate each section of the file name?

0 0

Answer

SELECTED ANSWER
replied on November 9, 2017 Show version history

There are a lot of different regular expressions that would work:

TEXT_TEXT_NUMBER

For the first TEXT, you could use: ^([A-Za-z]+) or [A-Za-z]+ (this second one will also capture the second text string, so you will specify that you want to capture only the first match)

For the second TEXT, you could use: _([A-Za-z]+)_

For the NUMBER, you could use: _(\d+) or just \d+

1 0
replied on November 9, 2017

Perfect!  Thank you for you quick reply.  

0 0
replied on November 9, 2017

One problem.  I didn't notice that some files are text_text-text_number.

The files with the hyphen in the middle text are coming back with that field blank in the template.  What should I add in the regex to accommodate the hyphen?

0 0
replied on November 9, 2017

Sure, so one thing to note is that brackets "[....]" denote what characters you want to capture. If you want to account for the hyphen, you just need to add it into the bracket. For you that would look like: _([A-Za-z-]+)_ 

This finds all letters A-Z, a-z, and hyphens that are in between your underscores.

 

1 0
replied on November 9, 2017

Thank you!  That helps a lot.  I'll add that to my notes.

0 0
replied on November 9, 2017

Now I have another group of files that are named ABCD_12345_67890

The regex I was using no longer works because \d+ isolates the first set of numbers.  How do I isolate the last set of numbers?

0 0
replied on November 9, 2017 Show version history

\d+$

$ denotes the end of a string. Conversely, ^ denotes the start of a string. 

1 0
replied on November 9, 2017

You've been a big help.  Thanks for the helpful link.  That will come in handy!

0 0
replied on November 10, 2017 Show version history

You could also use a base expression like this:

[^_]+_[^_]+_[^_]+

The [^_] means any thing but the underscore and followed by a + means to return as many consecutive instances that it finds.

Then you wrap the section that you want returned like this ([^_]+) and you create 3 tokens with expressions shown below:

First Token = ([^_]+)_[^_]+_[^_]+

Second Token = [^_]+_([^_]+)_[^_]+

Third Token = [^_]+_[^_]+_([^_]+)

0 0

Replies

You are not allowed to reply in this post.
You are not allowed to follow up in this post.

Sign in to reply to this post.