You are viewing limited content. For full access, please sign in.

Question

Question

Pattern Matching Inconsistent Results

asked on June 3, 2015

I'm trying to use pattern matching to parse out the project name from the entire page of text. Here is an example of what I'm trying to pull out (see highlighted text). 

For the Pattern Matching I'm having it search the text and capturing everything between "project known as" and "in the City of Shafter"

See Pattern Matching Setup Below

Out of 32 documents it only worked on 16 and I can't seem to determine why. 

Here is one that it worked and what was captured in the output:

Created document of class Notice of Completion.
Project Name Token : "Tract 5610 Unit 
Two" 
Document Number : inunuimnu Fees.
.
Recorded Date : ATTI
ages: I
4/19/1999
Added page 1.
Added page 2.

And one that is very similar that didn't work:

Created document of class Notice of Completion.
Project Name Token : 
Document Number : INIYYIIIIAAN� ._.
Recorded Date : 3 ATT I
)ages: I
Added page 1.
Added page 2.
Added page 3.

 

I would greatly appreciate any help!

0 0

Answer

SELECTED ANSWER
replied on June 4, 2015

I concur with this. However I like to use \s* because sometimes the OCR engine will remove the space as well. 

 

In fact, because certain characters will be mis-OCR'd I'd use the following:

pr.ject\s*kn.wn\s*a.\s*(,*)\s*.n\s*the\s*C.ty\s*.f\s*.hafter

 

I try to not match the following: "ilos1" plus the following capitals "BDOS"  -- if possible, I avoid them, or I use something like [S5] or [B8] or [il1] in place of them in my matches.

This doesn't work as well if you are trying to capture those words - but if you are just using them to help locate the text you actually want it doesn't matter if those words are mis-OCR'd. 

 

3 0

Replies

replied on June 3, 2015

Do all the cases where it fails have a line break somewhere within the "project known as" or "in the City of Shafter" text?

2 0
replied on June 3, 2015

Can you post the OCRd text snippet that comes out from one that doesn't work? You can find it by changing turning on the text view once it has been processed. My guess is one of the spaces in your regular expression phrase is registering as actually two spaces once OCRd. I would try replacing your spaces with \s+ and see if it works better. Without seeing the actual OCR text it is hard to say though.

Basically change "project known as" to the following along with every space you have in the regex:

project\s+known\s+as

 

2 0
replied on June 3, 2015

I'm not sure what you mean by "not getting the text to work with". I expect that in all the failed cases there is some character (such as a line break or an extra space) between words in the OCRed text for "project known as" and "in the City of Shafter". Since your pattern looks for those exact phrases with only a single space between words, any extra character between any of the words will prevent a match from being found.

To fix this, you'll want to follow John's advice, and change the pattern so that it matches even if there are line breaks or extra spaces between the words. 

2 0
SELECTED ANSWER
replied on June 4, 2015

I concur with this. However I like to use \s* because sometimes the OCR engine will remove the space as well. 

 

In fact, because certain characters will be mis-OCR'd I'd use the following:

pr.ject\s*kn.wn\s*a.\s*(,*)\s*.n\s*the\s*C.ty\s*.f\s*.hafter

 

I try to not match the following: "ilos1" plus the following capitals "BDOS"  -- if possible, I avoid them, or I use something like [S5] or [B8] or [il1] in place of them in my matches.

This doesn't work as well if you are trying to capture those words - but if you are just using them to help locate the text you actually want it doesn't matter if those words are mis-OCR'd. 

 

3 0
replied on June 4, 2015

Thank you all for your help. 

In case this is helpful to anyone else down the line it was the line break issue. The regex I ended up using included the \s* between all the words and it worked. Thank you!

2 0
replied on June 3, 2015

@Tessa Adair, 14 out of 16 cases failed when there was a line break somewhere within the "project known as" or in the City of Shafter text

The 2 exceptions when I try to look at the OCR'd text they don't have it. I am using document previously OCR'd in Adobe. I went back and tried to re OCR the offending documents and still am not getting the text to work with.

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.