You are viewing limited content. For full access, please sign in.

Question

Question

pattern match to capture what is after a string, in the middle of a string...

asked on August 6, 2015

Hi:

How can I get pattern matching to find a long string in a large area of data, and return only what is on the line directly after the string?

I can get the following address by using this: 

But now I want it to also grab the line immediately after it (which will sometimes be Grader Shed and sometimes Administration Building, or something like that) and nothing after those two words.  Either of the two following options would be acceptable for the multi-value field:

     1)  Can you show me how to get it to grab the two lines together and ignore everything that follows (there is a lot that follows in what it has to capture), or

     2)  Can you show me how to get it to grab the second line in a second pattern match (see above where I have tried two pattern match configurations in order to get the two lines)?

I thought something like this would work, but obviously not:

0 0

Answer

SELECTED ANSWER
replied on August 13, 2015 Show version history
 
 

Hmmm,

 

I wonder if there are multiple return lines or other white space that trips it up and isn't displaying in the test when we cut and paste. Try this, it's a little more robust.

\d\d\d\d \d\d \w\w\w \w+ \w\w \w\w\w \w\w\w[\s|\n|\r|\v|\f|\t]*([^\n]+)

 

 
1 0

Replies

replied on August 6, 2015

Lines are delimited by so-called "newline" characters or "line feed". In regular expression they're represented by \n. If you combine that with the "not in range" construct represented by [^RANGE_HERE], you can use [^\n]+ to get all characters that are not a newline character. So, where (.*) will get all characters until the end of the input data, ([^\n]+) will stop when it finds the first newline character.

1 0
replied on August 6, 2015

Hi,

If I understand, you want to return the 2nd line after a matched pattern.

 

To do this you will need to account for a line break or carriage return which is considered a character.

 

Adding in [\n|\r](.*) to the end of your pattern will grab only the information on the second line after the matched pattern.

 

 

Cheers,

Carl

 

1 0
replied on August 6, 2015

Awesome, Carl, thanks!  That works perfectly in the Testing spot on the token.

For some reason, the capture result includes everything after the word Grader Shed.  I shortened up the Zone so it doesn't grab as much, but I'm worried about making the zone too short and potentially missing the target. 

I double checked the field to make sure it wasn't filling in the zone OCR result and not the pattern match, but that wasn't it.  I also removed some extra brackets that I thought might be confusing things.  Still not stopping after Grader Shed.

0 0
replied on August 7, 2015

Oh Ok we can be more specific and fix that by what Miruna's posted to grab until the end of a line.

 

 

 

0 0
replied on August 7, 2015
\d\d\d\d \d\d \w\w\w \w+ \w\w \w\w\w \w\w\w[\n|\r]([^\n]+)

 

0 0
replied on August 10, 2015

Thanks, Carl.  I did try that last week and it works in the testing spot, but in this case, when running it on the actual document, the pattern match is pulling up nothing and the field is populating with everything the Zone OCR is capturing.

I tried it again this morning with your exact code, in case I had not translated Miruna's correctly, and got the same result.

This is the Zone area it is capturing from and I've tried increasing the range and making the range tighter, with no change:

Thx, Connie

0 0
SELECTED ANSWER
replied on August 13, 2015 Show version history
 
 

Hmmm,

 

I wonder if there are multiple return lines or other white space that trips it up and isn't displaying in the test when we cut and paste. Try this, it's a little more robust.

\d\d\d\d \d\d \w\w\w \w+ \w\w \w\w\w \w\w\w[\s|\n|\r|\v|\f|\t]*([^\n]+)

 

 
1 0
replied on August 14, 2015

Hey, that worked!  Awesome!

So, the first grouping is telling it, there could be a space, new line, carriage return, different tabs or form feed characters and to allow those and move on.

Then the asterisk, telling it...  grab everything here?

Then the last grouping which, if I understand correctly is telling it to stop at the next new-line character.  Right?

Awesome; thanks so much Carl!

Connie Prendergast, Flagstaff County

0 0
replied on August 14, 2015

Yep you got it! Here's a bit more detail.

 

1. When the following match applies:
         \d\d\d\d \d\d \w\w\w \w+ \w\w \w\w\w \w\w\w
     - followed by any group -- [ ], containing: 
           - white spaces   - \s
                --or--  |
           - new lines  - \n
                --or--  |
           - return lines  -\r
                --or--  |
           - vertical feeds  -\v
                 --or--  |
           - formfeeds  -\f
                 --or--  |
           - tabs  -\t
     - in any combination or absence of, (the reason for the | separator)
     - and this group can be found 0 or infinite times *
2. THEN RETURN  -- matches contained between ( )
      - get characters which ARE NOT a \n (new line)  -- [^\n]
      - and grab ALL the characters matching this-- +

 

Here's a site I use, which is a less friendly but completely free version of www.regexbuddy.com that parses these out in English because lets face it, sometimes these look simple but are not easy to read for those of us who only dabble our toes in.

 

Cheers!

Carl

 

 

 

1 0
You are not allowed to follow up in this post.

Sign in to reply to this post.