You are viewing limited content. For full access, please sign in.

Question

Question

Regex to LF regex

asked on October 23, 2019

I'd like to say that my regex skills are not my strength. I've been working on the following regex string to look ahead up to 16 characters and then determine if an SSN exists determined by the following criteria.

^(?=.{9,16}$)(\d{3}+[ |-]{1,3}+\d{2}+[ |-]{1,3}+\d{4})$

Based on the assessment of LF pages in our repository, I've seen various characters that don't allow simple SSN searches to come up with any type of high degree of accuracy. /d(3)-/d(2)-(/d4) works, but if the scanned document of converted LF page comes up with something with a space before or after the dash, it fails. I would like to use the above regex if possible to alleviate some of the misses. 

I have come up with the following, but I would like to perform a read ahead to make certain the max string length taken into consideration are only 16 characters long. At times we have false positives I'm trying to avoid as well.

(\d{3})+([ |-]{1,3})+(\d{2})+([ |-]{1,3})+(\d{4}) (this works, but doesn't limit length)

Could someone please tell me what the problem is with my regex? Also, if someone has a better/more robust way of doing it, I'd be very appreciative for that information as well.

0 0

Replies

replied on October 23, 2019

Hi James,

Here are a few resources that may be of help:

  1. Regular Expressions Cookbook: 4.12. Validating Social Security Numbers
  2. Validating Social Security Numbers through Regular Expressions
  3. How to ignore white space with regex?

 

It would be helpful to know the context in which you are performing this search. Is it in a workflow? Can you potentially drop out all spaces from the OCR'd text before running the search?

0 0
replied on October 23, 2019

Hi Samuel,

Thanks for the reply. It is in a workflow where I'm trying to use auto notation to redact sensitive information. My first stop was with SSN. I used an online regex tested and I thought I had it working, but LF workflow says I'm out of my mind. 

 

The process is to use auto-ocr to create text for all documents on a nightly basis. Once that is complete, my plan is to assign a security tag to all documents that have some type of sensitive information. Once i prove our the process and the accuracy, I'll redact the information from those documents and allow only certain groups to access those documents since redaction is only on the LF pages, not the document itself.

 

Dropping all spaces may work I suppose.

0 0
replied on October 28, 2019

The restriction to 9-16 chars should be redundant based on the 9 digits and up to 6 separator characters, shouldn't it? Though I don't understand what most of your `+` quantifiers are doing - like shouldn't the first group be just 3 digits once, not at least once?

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.