You are viewing limited content. For full access, please sign in.

Question

Question

Need a Pattern Match for a specific format

asked on January 29, 2015

Not sure why I'm having so much trouble with this, but I'd like for someone to help me with pattern matching this particular format:

  • It's for the name of a document, which will have 4 sections. I want to grab the last section of the name. Each section will always be separated by a "space dash space" like this: " - ".
  • The "space dash space" will always occur 3 times in the name. So a sample name would be like this: "Test - Annual Budget - 2015/1/29 - Spreadsheet".
  • The first three sections of the name might have anything in there, including alphanumeric, spaces, and symbols. It can also have a dash, but there will be no spaces with the dash. An example would be "Test - Semi-Annual Budget - 2015/1/29 - Spreadsheet".
  • I want my pattern matching to disregard the first three sections, and extract the last section, which is after the third "space dash space". So in the above example, the word "Spreadsheet" would be extracted.
  • Originally I was using the format .* - .* - .* - (.*)
  • Here's the problem: that last section can literally have anything in it, including letters, numbers, symbols, spaces, and even another "space dash space." So if we have the example "Test - Semi-Annual Budget - 2015/1/29 - Spreadsheet - New Document", I would want to extract the section in bold, as all of that occurs after the third "space dash space". 
  • But instead, it will only extract "New Document". It is pulling any text after the last "space dash space". I want it to grab everything after the third "space dash space", including any additional "space dash spaces". (I hope that makes sense to everyone).

 

I have included a screenshot of what I have so far and the behavior described above. I know there must be a way to accomplish this. Please help. Thanks.

Pattern Match Problem.jpg
0 0

Answer

SELECTED ANSWER
replied on January 29, 2015

Is the date always there? Can you match the date with the dash, and then do your (.*) after it?

2 0
replied on January 29, 2015 Show version history

I'm hijacking John's awesome idea to utilize the date as a marker that your desired text is approaching.  I used the following pattern to retrieve the text:

 

/\d{2}\s-\s(.*)

 

 

You may require a slightly different pattern if the format of the input string ever gets modified.  Let us know if you have any other questions!

0 0
replied on January 30, 2015

Thanks for the input from everyone. Yes the third section is always a date, and so I went with that. I also used a part of Miruna's suggestion of "digits and slashes", which was the [0-9/]+. This will allow me to match just about any date formatting, regardless of one or two-digit month, two or four-digit year, etc. I finally settled on the following expression:

[0-9/]+ - (.+)

I'm pulling from a pre-formatted name, so I always know how each section is formatted. In my testing, this expression will always match the first section of dates that it finds, which will always occur in the 3rd section of the document's name. Even if I were to put dates in the 4th and subsequent sections, it will still grab everything after the 3rd section. The 1st and 2nd sections will never have formatting like this, so this expression looks like the winner.

Thanks again to everyone.

2 0

Replies

replied on January 29, 2015

You should make your quantifiers for the first three groups lazy (technical term), so that they will not aggressively match "extra" space-hyphen-space sequences that may occur in the fourth group.  You can do this by using the "*?" quantifier instead of "*", like so:

.*? - .*? - .*? - (.*)

I have not extensively tested this, but it seems to work for your given example.

2 0
replied on January 30, 2015

To expand on this concept more, I recommend reading reading this stackoverflow post which cites this tutorial.

The key takeaway is that the * wildcard will try to match as much as possible, but adding the ? after the * tells it to match the shortest possible expression.

0 0
replied on January 30, 2015

That's because you're using ".*". That covers any characters and matches as much as possible, so the 4th group gets rolled into one of the previous ones. In your example, you also have dashes in the middle of what you're expecting to be the second group, so that makes it more complicated because it now has even more possibilities for matching a group. Is the 3rd group always a date?

If yes, you can try something like this: ^[^-]+ - (?:[^-]+-?[^-]+) - [0-9/]+ - (.+)$

(^ at the beginning and $ at the end tell pattern matching to look at the whole input value to match the pattern. The first group is "anything but a dash", the second one is "at most one dash in the middle of 2 words", the third one is "digits and slashes" to account for variations on the date format, but you could refine further if you know the format is always the same. Then the last is "whatever is left all the way to the end")

1 0
replied on February 13, 2015 Show version history

I'm a bit late to this party but I'd like to join in on the fun ~

 

You can also resolve this problem by "starting from the back". This pattern works for me, and is a bit more straight-forward (in my opinion):

 

 

This will work even if your input does not contain the date, or if the date is formatted differently (such as with hyphens). The only constraint here is that it uses the last two hyphens are delimiters, starting from the end. That is, it assumes that the text you want is indicated by the finding the second-to-last hyphen and starting there (and going until the end). It also means that if the second-to-last word is "Spread-sheet", or if the document name is "New-Document" or something, that this won't work. But you have to define *some* kind of rule for the patterns to work anyways =). Oh, it also assumes that this is the whole input. That is, you're not finding this in a bunch of text, but rather you are given this input and you are running pattern matching on this whole Test Value as though it were the start-and-end of the input. You could always do a pattern match to capture this particular text and then perform these steps.

Basically what this pattern says is "Get me anything except a dash, then a dash, then anything except a dash, and then make absolute sure that we are at the end of the input right after that". The last [^-]* is important because being specific about what you want to match, it will try to get as big of a match as possible. Regexes are both hungry and greedy. Specifically, the last [^-]* ensures that you are actually starting that section of the match between the last hyphen and the end of the input. If you instead used .* for that part of the input, then it would "prefer" to start that part of the match at the character right after the first hyphen which would result in a valid match and a much bigger one at the same time.

You might notice this also captures an extra space at the start. You can get rid of that by instead using this:

[^\s-]*\s-[^-]*$

That is, "everything except a space or a hyphen, then a space, then a hyphen, then everything except a hyphen, then the end of the input".

If the spaces are "optional", then you can do 

\s*([^-]*-[^-]*)$

That is, there are zero or more spaces before the first part of the match, and then we only capture the part that doesn't include the space.

 

If you prefer to "visualize" this as capturing part of the whole match, then you could instead look at it this way:

^.*-\s([^-]*-[^-]*)$

That is, "Start of the input, then a bunch of stuff, BUT make sure you also match a hyphen, a space, everything until a hyphen, then a hyphen, then everything until a hyphen, and then the end of the input. Oh, and make sure you capture everything starting from the first hyphen+space which we mentioned, all the way to the end of the input". 

Different way to visualize the same solution, but it matches your entire input and only captures a specific part.

 

Regexes are fun ~

 

 

1 0
You are not allowed to follow up in this post.

Sign in to reply to this post.