You are viewing limited content. For full access, please sign in.

Question

Question

Too large of a regular expression?

asked on March 4, 2022 Show version history

Is there a limit to how many characters of RegEx that the Pattern Matching activity can handle?

The block of regex I need is 965 characters. It has grown to that size and this last iteration is forcing the workflow instance to freeze and NOT terminate for at least 10 minutes. It will time out several times and then eventually terminate. The only changes made have been to the complexity of the regex in pattern matching. 

I have attempted to use this technique and that also will not terminate the workflow. 

I think of regex as a low bandwidth type of processing so I am little confused by this issue. 

Thanks

 

0 0

Replies

replied on March 8, 2022

I am unsure which specific regular expression algorithm Laserfiche uses, however if using DFA, the system will require O(2^m) memory, where m is the regular expression length, but can be run in O(n) time where n is the length of your input string. If using NFA, the memory issue is fixed however the run time is increased to O(mn). In either case, with a 965 character result string, I would suggest breaking down your pattern into smaller patterns that you can concatenate later with tokens. Regular Expressions can be very efficient with small patterns and strings, however they can become overwhelmed quickly. 

1 0
replied on March 8, 2022

This is all true in the abstract, but the types of pattern-matching tasks that a regex would be used for in Workflow are typically of low computational complexity. What is more likely is that the regular expression is written inefficiently and it could be modified to run in a reasonable amount of time. Something as simple as turning a quantifier from greedy to reluctant can have a big impact.

Can you share the regex and the text that it times out on?

1 0
replied on March 9, 2022 Show version history

I received similar advice from my VAR on splitting it into smaller chunks. We are actually pursuing an additional quick fields license as an alternative.

 

If you remove one of the strings after the or ( | ) it will run quickly. The last | I added is what pushed it too far..

RegEx:

.+(?:[Acount]{7}\s[amount]{6})\s+\r\n([\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,25})\s0\s+[\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,40}\s+\d{2}\/\d{2}\/\d{4}|.+(?:[Acount]{7}\s[amount]{6})\s+\r\n([\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,25})\s\d{10}\s+[\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,40}\s+\d{2}\/\d{2}\/\d{4}|.{40,60}.+\s+\r\n([\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,25})\s0\s[\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,40}\s\d{2}\/\d{2}\/\d{4}|.{40,60}.+\s+\r\n([\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,25})\s\d{10}\s[\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,40}\s\d{2}\/\d{2}\/\d{4}|\d{2}\/\d{2}\/\d{4}\s[\d.,-]{4,15}\s[\d.,-]{4,15}\s+\r\n([\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,25})\s\0\s+[\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,40}\s\s\r\n|\d{2}\/\d{2}\/\d{4}\s[\d.,-]{4,15}\s[\d.,-]{4,15}\s+\r\n([\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,25})\s\d{10}\s+[\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/]{1,40}\s\s\r\n

 

test text:


  
                                 EDUCATIONAL SERVICE DIST 112                                    
  
    A & J SE000                                  Check No.       410457                                     
    A & J SELECT                                 Check Date      02/28/2022  
                                                 Check Type      Computer       
    PO BOX 789                                                    
    STEVENSON, WA 98648                                           
                                                                   
  
    Vendor Continued Void used                   Check No.       410456      
  
Invoice #               P.O. #         Inv Description               Inv Date                         Gross                    Net  
                        Adjustment Desc                  Adj Amount  Discount Desc                                   Disc Amount   
                                                                     Account Number                               Account Amount  
  
0004                    4002200008     Food and Supplies for Life    01/21/2022                       26.63                  26.63      
                                       Skills in Stevenson. School   
                                       Year 2021-2022.               
  
                                                                     01 E 530 2100 27 5000 2400 0000 0000 0                26.63      
  
0026                    4002200008     Food and Supplies for Life    02/11/2022                       15.06                  15.06      
                                       Skills in Stevenson. School   
                                       Year 2021-2022.               
  
                                                                     01 E 530 2100 27 5000 2400 0000 0000 0                15.06      
  
0038                    4002200008     Food and Supplies for Life    12/03/2021                       21.45                  21.45      
                                       Skills in Stevenson. School   
                                       Year 2021-2022.               
  
                                                                     01 E 530 2100 27 5000 2400 0000 0000 0                21.45      
  
0141                    4002200008     Food and Supplies for Life    12/10/2021                       23.94                  23.94      
                                       Skills in Stevenson. School   
                                       Year 2021-2022.               
  
                                                                     01 E 530 2100 27 5000 2400 0000 0000 0                23.94      
  
800265                  4002200008     Food and Supplies for Life    12/17/2021                       19.06                  19.06      
                                       Skills in Stevenson. School   
                                       Year 2021-2022.               
  
                                                                     01 E 530 2100 27 5000 2400 0000 0000 0                19.06      
  
800265  10/28/21        4002200008     Food and Supplies for Life    10/28/2021                       27.56                  27.56      
                                       Skills in Stevenson. School   
                                       Year 2021-2022.               
  
                                                                     01 E 530 2100 27 5000 2400 0000 0000 0                27.56      
  
                                       CHECK TOTAL                                                   133.70      
  
                        Voucher Continued.....                              
   

 

0 0
replied on March 9, 2022

That last one seems to be missing the starting [.

0 0
replied on March 9, 2022

I am not entirely sure what you are trying to match here. In the above example what data are you trying to partition out? Could you give an example of the expected outcome string?

0 0
replied on March 9, 2022

Justin Caughlan, the invoice number in the at column, there are 6. Thanks!

0 0
replied on March 9, 2022

^(\d*)\s

 

That should get you just the invoice numbers.

replied on March 9, 2022

0 0
replied on March 9, 2022 Show version history

Running that regex against that text takes over 1000ms on my machine, using the regex engine from .Net. That is not fast, but I can't think of a way for that to take 10 minutes on the WF server. Did this exact input cause a timeout? Is it possible that the regex is run enough times to add up to 10 minutes?

Obviously the timeout is the most pressing issue, but readability in a regex is important too. The fragment "[amount]{6}" is kind of unconventional, is there a reason not to just use the literal word "amount"? Is there a way to replace [\d\w\s-!@#$%^&*()_+|~=`{}\[\]:";'<>?,.\/] with character classes, or maybe a negated character set?

0 0
replied on March 9, 2022

the fragment for 'amount' is required to successfully omit incorrect strings.

The big string of symbols is in case one of those symbols appears. I tried to use \S but that was not retrieving what I needed. Any other suggestions? Thanks so much. 

0 0
replied on March 9, 2022

Looks like Laserfiche has trouble with the ^ character. \n(\d+)\s seems to work.

0 0
replied on March 9, 2022

^ is the beginning of the input, which in this case would be the top of the text file.

0 0
replied on March 9, 2022

Have you looked into using Smart Invoice Capture for this instead of regular expressions? 

1 0
replied on March 9, 2022

isnt smart invoice capture cloud only?

 

At any rate I am going to come at this from another angle where I need a fraction of the regex. Thanks for the help everyone!!

0 0
replied on March 10, 2022

For now, yes, but it will be available for self-hosted very soon!

0 0
replied on April 30, 2024

@████████ any idea on a timetable for on prem? I didn't see anything about it at Empower being available for on-prem. I'm looking at it as a cloud add on to my on prem system. I would prefer to not have to create a cloud instance to use it. 

0 0
replied on May 1, 2024

There is currently no timeline for making smart invoice capture available to self-hosted customers.

0 0
replied on May 1, 2024

Bummer, 2 years ago Tessa said it would be available "very soon".
I'm concerned about the cost of paying for a Cloud instance purely to use smart invoice capture. I know it is the only solution right now, but it would be nice to have a specific package purely for us on-prem customers to use smart invoice capture. I know ABBYY has a similar product that would work with on-prem systems where I would not have to have a separate cloud repository. 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.