You are viewing limited content. For full access, please sign in.

Question

Question

Pattern Matching Hangs in workflow on a multi-line regex

asked on March 17, 2017

I've got some text in multiple lines that I want to capture. Using the 'testing area' I'm able to see that it should process fine. However, when I run the workflow, it gets stuck at this pattern (I had all the patterns in one pattern matching step but split them out to see which one was causing the issue). 

 

The workflow will never terminate, it will stay 'running' but will never do anything past that pattern. 

The pattern regex is: Caller Address:\s*((.|\s*)*)Email Address:

Test text is: 

Caller Address: 12345 MAIN ST
NOWHEREVILLE, MB T2T 2T2
Email Address: email@company.com

Result value shows: (which is perfect)

12345 MAIN ST
NOWHEREVILLE, MB T2T 2T2

 

What am I missing here? Issue with WFS? 

 

Thanks in advance for any help. 

Pattern Issue.JPG
0 0

Replies

replied on March 17, 2017

Thanks Ryan. I did some more digging and found that [\s\S]*? will work. 

 

How would you implement your multi part solution? Have the pattern run through it twice and use the values to combine?

1 0
replied on March 18, 2017 Show version history

For your address try this regexp:  (?s)Caller Address:\s(.*)\r\nEmail

That should grab all the lines between the 'Caller Address: ' and the Email line and remove the last return/line feed.

Andrew

 

1 0
replied on March 20, 2017

Thanks Andrew, I will give this a try, very much appreciated. 

0 0
replied on March 17, 2017 Show version history

I am not well versed in regular expressions, but I believe that the nesting quantifier might be the cause of the problem:

https://msdn.microsoft.com/en-us/library/3206d374(v=vs.110).aspx

In you example, the culprit would be the following part of the regular expression:

(.|\s*)*

As explained in the MSDN article linked above, if the asterisk is nested within another asterisk, it will cause an exponential increase in execution time.

The quantifiers that are affected by this issue are:

  • "Match zero or more times" (asterisk)
  • "Match one or more times" (plus)

 

The reason is that

  • Quantifiers are implemented using backtracking.
    See: https://msdn.microsoft.com/en-us/library/dsy130b4(v=vs.110).aspx
  • .NET places the responsibility for an regular expression's performance on the person who constructs the regular expression.
    That rationale is explained in the article on backtracking linked above.
  • Nested quantifiers without an upper bound trigger an exponential increase in execution time.
  • The two quantifiers affected by these two issues (asterisk and plus) are without an upper bound. Thus, even a string of ordinary length will cause a performance impact.

 

Here is a benchmark program I used to learn about this problem:

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Diagnostics;

namespace Regex_LFAnswers_117431
{
    class Program
    {
        static void WriteLine(string s)
        {
            Console.WriteLine(s);
            Trace.WriteLine(s);
        }

        /// <summary>
        /// <para>
        /// Realistically, OCR result strings (whether whole page or zone OCR) may contain 
        /// plenty of extra white space.
        /// </para>
        /// <para>
        /// If one of the regular expressions are susceptible to the "exponential 
        /// backtracking problem", and if that regular expression's susceptibility applies 
        /// to whitespaces as a character class, then we can trigger a performance problem
        /// (hang) as the number of whitespaces is increased.
        /// </para>
        /// <para>
        /// For example, setting this to 25 will take approximately one lunchtime (one hour)
        /// on my computer.
        /// </para>
        /// </summary>
        static int NumExtraSpaces
        {
            get;
        } = 14;

        /// <summary>
        /// <para>
        /// Cause of performance problem.
        /// </para>
        /// <para>
        /// The nesting of a non-upper-bounded whitespace quantifier inside another 
        /// non-upper-bounded qualifier is the cause of the problem.
        /// </para>
        /// </summary>
        static Regex BadRegex
        {
            get;
        } = new Regex(@"Caller Address:\s*((.|\s*)*)Email Address:", RegexOptions.Singleline);

        /// <summary>
        /// <para>
        /// Solution to the performance problem.
        /// </para>
        /// <para>
        /// Given the nature of the task (the original intent of the problematic regex),
        /// it can be easily replaced by a zero-or-more wildcard match.
        /// </para>
        /// <para>
        /// To handle the issue of leading and trailing whitespaces (trimming of each line), 
        /// we separate that into a second regex step.
        /// </para>
        /// <para>
        /// In general, simplification of nested regex by separating tasks into multiple 
        /// steps is a good way to fix a regex performance problem.
        /// </para>
        /// </summary>
        static Regex GoodRegex_StepOne
        {
            get;
        } = new Regex(@"Caller Address:(.*)Email Address:", RegexOptions.Singleline);

        /// <summary>
        /// <para>
        /// One way of removing leading and trailing whitespaces from each line.
        /// </para>
        /// <para>
        /// This is one of the three variations that seem to work correctly.
        /// </para>
        /// </summary>
        static Regex StepTwo_VariantOne
        {
            get;
        } = new Regex(@"^\s*(.+?)\s*$", RegexOptions.Multiline);

        /// <summary>
        /// A second way of removing leading and trailing whitespaces from each line.
        /// </summary>
        static Regex StepTwo_VariantTwo
        {
            get;
        } = new Regex(@"^\s*([^\r\n]+)\s*$", RegexOptions.Multiline);

        /// <summary>
        /// A third way of removing leading and trailing whitespaces from each line.
        /// </summary>
        static Regex StepTwo_VariantThree
        {
            get;
        } = new Regex(@"\s*([^\r\n]+)\s*([\r\n]+\s*([^\r\n]+)\s*)*", RegexOptions.Singleline);

        /// <summary>
        /// Test program.
        /// </summary>
        static void Main(string[] args)
        {
            // ======
            string extraString = new string(' ', NumExtraSpaces);
            // ======
            string ocrString = "Caller Address: 12345 MAIN ST\r\n NOWHEREVILLE, MB T2T 2T2\r\n Email Address: email@company.com";
            ocrString = extraString + ocrString + extraString;
            // ======
            Regex regexOne = BadRegex;
            //Regex stepOne = GoodRegex_StepOne;
            // ======
            Regex regexTwo = StepTwo_VariantOne;
            //Regex stepTwo = StepTwo_VariantTwo;
            //Regex stepTwo = StepTwo_VariantThree;
            // ======
            List<string> resultsOne = new List<string>();
            List<string> resultsTwo = new List<string>();
            // ======
            Stopwatch sw = Stopwatch.StartNew();
            // ======
            foreach (Match m1 in (MatchCollection)(regexOne.Matches(ocrString)))
            {
                string v1 = m1.Groups[1].Value;
                resultsOne.Add(v1);
                foreach (Match m2 in (MatchCollection)(regexTwo.Matches(v1)))
                {
                    string v2 = m2.Groups[1].Value;
                    resultsTwo.Add(v2);
                }
            }
            // ======
            sw.Stop();
            // ======
            WriteLine("Matches from first Regex step:");
            for (int k1 = 0; k1 < resultsOne.Count; ++k1)
            {
                string v1 = resultsOne[k1];
                v1 = v1.Replace("\r", "\\r");
                v1 = v1.Replace("\n", "\\n");
                WriteLine($"s1[{k1}] = ^{v1}$");
            }
            // ======
            WriteLine("Matches from second Regex step:");
            for (int k2 = 0; k2 < resultsTwo.Count; ++k2)
            {
                string v2 = resultsTwo[k2];
                v2 = v2.Replace("\r", "\\r");
                v2 = v2.Replace("\n", "\\n");
                WriteLine($"s2[{k2}] = ^{v2}$");
            }
            // ======
            WriteLine($"Total time spent in Regex: {sw.ElapsedMilliseconds} msecs.");
            WriteLine("");
        }
    }
}

 

0 0
You are not allowed to follow up in this post.

Sign in to reply to this post.