I am not well versed in regular expressions, but I believe that the nesting quantifier might be the cause of the problem:
https://msdn.microsoft.com/en-us/library/3206d374(v=vs.110).aspx
In you example, the culprit would be the following part of the regular expression:
(.|\s*)*
As explained in the MSDN article linked above, if the asterisk is nested within another asterisk, it will cause an exponential increase in execution time.
The quantifiers that are affected by this issue are:
- "Match zero or more times" (asterisk)
- "Match one or more times" (plus)
The reason is that
- Quantifiers are implemented using backtracking.
See: https://msdn.microsoft.com/en-us/library/dsy130b4(v=vs.110).aspx
- .NET places the responsibility for an regular expression's performance on the person who constructs the regular expression.
That rationale is explained in the article on backtracking linked above.
- Nested quantifiers without an upper bound trigger an exponential increase in execution time.
- The two quantifiers affected by these two issues (asterisk and plus) are without an upper bound. Thus, even a string of ordinary length will cause a performance impact.
Here is a benchmark program I used to learn about this problem:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Diagnostics;
namespace Regex_LFAnswers_117431
{
class Program
{
static void WriteLine(string s)
{
Console.WriteLine(s);
Trace.WriteLine(s);
}
/// <summary>
/// <para>
/// Realistically, OCR result strings (whether whole page or zone OCR) may contain
/// plenty of extra white space.
/// </para>
/// <para>
/// If one of the regular expressions are susceptible to the "exponential
/// backtracking problem", and if that regular expression's susceptibility applies
/// to whitespaces as a character class, then we can trigger a performance problem
/// (hang) as the number of whitespaces is increased.
/// </para>
/// <para>
/// For example, setting this to 25 will take approximately one lunchtime (one hour)
/// on my computer.
/// </para>
/// </summary>
static int NumExtraSpaces
{
get;
} = 14;
/// <summary>
/// <para>
/// Cause of performance problem.
/// </para>
/// <para>
/// The nesting of a non-upper-bounded whitespace quantifier inside another
/// non-upper-bounded qualifier is the cause of the problem.
/// </para>
/// </summary>
static Regex BadRegex
{
get;
} = new Regex(@"Caller Address:\s*((.|\s*)*)Email Address:", RegexOptions.Singleline);
/// <summary>
/// <para>
/// Solution to the performance problem.
/// </para>
/// <para>
/// Given the nature of the task (the original intent of the problematic regex),
/// it can be easily replaced by a zero-or-more wildcard match.
/// </para>
/// <para>
/// To handle the issue of leading and trailing whitespaces (trimming of each line),
/// we separate that into a second regex step.
/// </para>
/// <para>
/// In general, simplification of nested regex by separating tasks into multiple
/// steps is a good way to fix a regex performance problem.
/// </para>
/// </summary>
static Regex GoodRegex_StepOne
{
get;
} = new Regex(@"Caller Address:(.*)Email Address:", RegexOptions.Singleline);
/// <summary>
/// <para>
/// One way of removing leading and trailing whitespaces from each line.
/// </para>
/// <para>
/// This is one of the three variations that seem to work correctly.
/// </para>
/// </summary>
static Regex StepTwo_VariantOne
{
get;
} = new Regex(@"^\s*(.+?)\s*$", RegexOptions.Multiline);
/// <summary>
/// A second way of removing leading and trailing whitespaces from each line.
/// </summary>
static Regex StepTwo_VariantTwo
{
get;
} = new Regex(@"^\s*([^\r\n]+)\s*$", RegexOptions.Multiline);
/// <summary>
/// A third way of removing leading and trailing whitespaces from each line.
/// </summary>
static Regex StepTwo_VariantThree
{
get;
} = new Regex(@"\s*([^\r\n]+)\s*([\r\n]+\s*([^\r\n]+)\s*)*", RegexOptions.Singleline);
/// <summary>
/// Test program.
/// </summary>
static void Main(string[] args)
{
// ======
string extraString = new string(' ', NumExtraSpaces);
// ======
string ocrString = "Caller Address: 12345 MAIN ST\r\n NOWHEREVILLE, MB T2T 2T2\r\n Email Address: email@company.com";
ocrString = extraString + ocrString + extraString;
// ======
Regex regexOne = BadRegex;
//Regex stepOne = GoodRegex_StepOne;
// ======
Regex regexTwo = StepTwo_VariantOne;
//Regex stepTwo = StepTwo_VariantTwo;
//Regex stepTwo = StepTwo_VariantThree;
// ======
List<string> resultsOne = new List<string>();
List<string> resultsTwo = new List<string>();
// ======
Stopwatch sw = Stopwatch.StartNew();
// ======
foreach (Match m1 in (MatchCollection)(regexOne.Matches(ocrString)))
{
string v1 = m1.Groups[1].Value;
resultsOne.Add(v1);
foreach (Match m2 in (MatchCollection)(regexTwo.Matches(v1)))
{
string v2 = m2.Groups[1].Value;
resultsTwo.Add(v2);
}
}
// ======
sw.Stop();
// ======
WriteLine("Matches from first Regex step:");
for (int k1 = 0; k1 < resultsOne.Count; ++k1)
{
string v1 = resultsOne[k1];
v1 = v1.Replace("\r", "\\r");
v1 = v1.Replace("\n", "\\n");
WriteLine($"s1[{k1}] = ^{v1}$");
}
// ======
WriteLine("Matches from second Regex step:");
for (int k2 = 0; k2 < resultsTwo.Count; ++k2)
{
string v2 = resultsTwo[k2];
v2 = v2.Replace("\r", "\\r");
v2 = v2.Replace("\n", "\\n");
WriteLine($"s2[{k2}] = ^{v2}$");
}
// ======
WriteLine($"Total time spent in Regex: {sw.ElapsedMilliseconds} msecs.");
WriteLine("");
}
}
}