Fix incorrect atomic loop optimization when body contains backtracking#124254
Fix incorrect atomic loop optimization when body contains backtracking#124254stephentoub merged 1 commit intodotnet:mainfrom
Conversation
MakeLoopAtomic wraps a loop in an Atomic group, which discards all backtracking state from the body. However, CanBeMadeAtomic only proves that giving back iterations is unnecessary... it does not account for within-body backtracking. When the body itself contains backtracking constructs, the atomic wrapper incorrectly suppresses that internal backtracking, producing wrong match results. The fix adds a MayContainBacktracking check that walks the loop body tree before allowing MakeLoopAtomic. A loop is only made atomic when the body has no backtracking constructs of its own. The backtracking-construct detection logic is extracted into a reusable IsBacktrackingConstruct property on RegexNode, which RegexTreeAnalyzer now also uses, removing the duplicated switch. Tests are added covering lazy/greedy single-char loops, char-class loops, alternations, and general multi-char loops inside outer loop bodies, as well as cases where the optimization should still apply (non-backtracking bodies).
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
There was a problem hiding this comment.
Pull request overview
Fixes an unsound regex optimization where wrapping certain loops in an Atomic group could incorrectly suppress within-body backtracking (not just “giving back” loop iterations), leading to wrong match results. The PR adds a conservative check to ensure a loop body has no intrinsic backtracking constructs before applying MakeLoopAtomic, and refactors backtracking-construct detection into a reusable RegexNode property.
Changes:
- Add a subtree walk (
MayContainBacktracking) to blockMakeLoopAtomicwhen the loop body contains backtracking constructs. - Extract backtracking-construct detection into
RegexNode.IsBacktrackingConstructand reuse it fromRegexTreeAnalyzer. - Add regression tests covering nested-loop cases where atomic wrapping must not suppress inner backtracking.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs | Adds regression cases for loops whose bodies require backtracking to produce correct matches. |
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexTreeAnalyzer.cs | Replaces a duplicated switch with RegexNode.IsBacktrackingConstruct. |
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs | Adds IsBacktrackingConstruct + MayContainBacktracking and gates MakeLoopAtomic accordingly. |
Comments suppressed due to low confidence (1)
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs:2612
- The MayContainBacktracking summary/remarks mention “variable-width loops”, but the detection is based on
IsBacktrackingConstructwhich keys offM != N(variable iteration count) and alternation. Consider rewording the summary to avoid implying it’s about width rather than iteration variability.
/// <summary>
/// Checks whether a node tree may contain backtracking constructs (variable-width loops or alternations).
/// </summary>
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
/ba-g failures are unrelated |
|
/backport to release/10.0 |
|
Started backporting to |
|
Here's the repro app used to validate the servicing fix: // Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// Repro for: https://github.com/dotnet/runtime/pull/124254
// Main PR: https://github.com/dotnet/runtime/pull/124254
// Servicing: https://github.com/dotnet/runtime/pull/124287 (10.0.4)
// Repro from: https://github.com/dotnet/runtime/pull/124287 (test cases)
//
// MakeLoopAtomic incorrectly wraps loops whose body contains backtracking
// constructs (e.g., lazy quantifiers, alternations). The atomic wrapper
// suppresses within-body backtracking, producing wrong match results.
// Expected: Regex matches the full intended substring.
// Actual: Regex fails to match or matches incorrectly.
using System;
using System.Text.RegularExpressions;
class Repro_124254
{
static int Main()
{
int failures = 0;
// Lazy loop inside loop body: .*? must expand past first ) to find correct closing delimiter
failures += Check(@"\w+(\(.*?\))?,", @"Foo(one()?,two,three),", @"Foo(one()?,two,three),");
// Another variant: lazy dot-star inside optional group
failures += Check(@"\w+(<.*?>)?!", "foo<a>b>!", "foo<a>b>!");
// Greedy loop body with backtracking
failures += Check(@"\w+(\(.+\))?,", "Foo(a),(b)x,", "Foo(a),");
// Alternate inside loop body
failures += Check(@"\w+(\((?:a|a\)b)\))?,", "Foo(a)b),", "Foo(a)b),");
if (failures == 0)
{
Console.WriteLine("PASS: All regex patterns matched correctly");
return 0;
}
Console.WriteLine($"FAIL: {failures} pattern(s) failed");
return 1;
}
static int Check(string pattern, string input, string expected)
{
var match = Regex.Match(input, pattern);
if (!match.Success || match.Value != expected)
{
Console.WriteLine($"FAIL: Pattern '{pattern}' on '{input}'");
Console.WriteLine($" Expected: '{expected}'");
Console.WriteLine($" Actual: '{(match.Success ? match.Value : "(no match)")}'");
return 1;
}
return 0;
}
}Results: repro_124254 — Regex atomic loop optimizationMain PR: #124254 Baseline (expect failure — bug should reproduce)
Validation (expect pass — fix should resolve)
|
MakeLoopAtomic wraps a loop in an Atomic group, which discards all backtracking state from the body. However, CanBeMadeAtomic only proves that giving back iterations is unnecessary... it does not account for within-body backtracking. When the body itself contains backtracking constructs, the atomic wrapper incorrectly suppresses that internal backtracking, producing wrong match results.
The fix adds a MayContainBacktracking check that walks the loop body tree before allowing MakeLoopAtomic. A loop is only made atomic when the body has no backtracking constructs of its own. The backtracking-construct detection logic is extracted into a reusable IsBacktrackingConstruct property on RegexNode, which RegexTreeAnalyzer now also uses, removing the duplicated switch.