Handle Capture nodes in TryGetOrdinalCaseInsensitiveString#124842
Handle Capture nodes in TryGetOrdinalCaseInsensitiveString#124842danmoseley wants to merge 7 commits intodotnet:mainfrom
Conversation
TryGetOrdinalCaseInsensitiveString iterates the children of a Concatenate node to extract an ordinal case-insensitive prefix string. Previously it did not handle Capture or nested Concatenate nodes, causing patterns like \b(in)\b with IgnoreCase to miss the optimal LeadingString_OrdinalIgnoreCase search path and fall through to the slower FixedDistanceSets path. Unwrap Capture nodes transparently and recurse into Concatenate children, matching the behavior of FindPrefixesCore. Co-authored-by: Copilot <[email protected]>
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
There was a problem hiding this comment.
Pull request overview
This PR improves regex prefix analysis for ordinal case-insensitive searches by making TryGetOrdinalCaseInsensitiveString treat Capture nodes as transparent, enabling more patterns (e.g., \b(in)\b with IgnoreCase) to use the faster leading-string search strategy.
Changes:
- Unwrap
Capturenodes during ordinal ignore-case prefix extraction and handle nestedConcatenatenodes via recursion. - Add unit tests validating that capture groups don’t prevent ordinal ignore-case leading-prefix detection.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs |
Makes prefix extraction skip over Capture nodes and recurse into nested Concatenate nodes to find an ordinal ignore-case prefix. |
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexFindOptimizationsTests.cs |
Adds test cases ensuring capture groups are transparent to ordinal ignore-case prefix extraction. |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
Add TryEnsureSufficientExecutionStack check before the recursive call in TryGetOrdinalCaseInsensitiveString to safely handle deeply nested capture patterns like ((((ab)))) without risking a stack overflow. Add OuterLoop test exercising 2000-deep capture nesting. Co-authored-by: Copilot <[email protected]>
TryGetOrdinalCaseInsensitiveString is also called from the compiler and source generator (EmitConcatenation). Although TryGetJoinableLengthCheckChildRange currently excludes Capture nodes from the joinable range, add an explicit unwrapCaptures parameter (default false) as defense-in-depth so only the prefix analysis caller opts into Capture unwrapping. Co-authored-by: Copilot <[email protected]>
|
@MihuBot regexdiff |
|
180 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"\\b(in)\\b" (658 uses)[GeneratedRegex("\\b(in)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 2 characters.
if (pos <= inputSpan.Length - 2)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_409072BF36F03A4496ACC585815833300ABA306360D979616ACDCED385DDC8FB);
+ // The pattern has the literal "in" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_in_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_409072BF36F03A4496ACC585815833300ABA306360D979616ACDCED385DDC8FB = SearchValues.Create(["IN", "iN", "In", "in"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "in".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_in_OrdinalIgnoreCase = SearchValues.Create(["in"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(from).+(to)\\b.+" (316 uses)[GeneratedRegex("\\b(from).+(to)\\b.+", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 8 characters.
if (pos <= inputSpan.Length - 8)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04);
+ // The pattern has the literal "from" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_from_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04 = SearchValues.Create(["FROM", "fROM", "FrOM", "frOM", "FRoM", "fRoM", "FroM", "froM", "FROm", "fROm", "FrOm", "frOm", "FRom", "fRom", "From", "from"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "from".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_from_OrdinalIgnoreCase = SearchValues.Create(["from"], StringComparison.OrdinalIgnoreCase);
}
}"(DATEADD|DATEPART)\\(\\s*(YEAR|Y|YY|YYYY|MON ..." (294 uses)[GeneratedRegex("(DATEADD|DATEPART)\\(\\s*(YEAR|Y|YY|YYYY|MONTH|MM|M|DAYOFYEAR|DY|DAY|DD|D|WEEKDAY|DW|HOUR|HH|MINUTE|MI|N|SECOND|SS|S|MILLISECOND|MS)\\s*\\,", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)] // Any possible match is at least 10 characters.
if (pos <= inputSpan.Length - 10)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_OrdinalIgnoreCase_2AC5E9CD8492EE9AF8BE2E7D112B6E7B0E2EB16F4F0FF47ECAA2B811EE26A081);
+ // The pattern has the literal "date(" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_1DE7C48BB4BC0E30E65E38B4F39A75CA57C22461AE122A6380A42312C9E67BCA);
if (i >= 0)
{
base.runtextpos = pos + i;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_OrdinalIgnoreCase_2AC5E9CD8492EE9AF8BE2E7D112B6E7B0E2EB16F4F0FF47ECAA2B811EE26A081 = SearchValues.Create(["dateadd", "datepart"], StringComparison.OrdinalIgnoreCase);
+ /// <summary>Supports searching for the string "date(".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_1DE7C48BB4BC0E30E65E38B4F39A75CA57C22461AE122A6380A42312C9E67BCA = SearchValues.Create(["date("], StringComparison.OrdinalIgnoreCase);
}
}"\\b(et\\s*(le|la(s)?)?)\\b.+" (291 uses)[GeneratedRegex("\\b(et\\s*(le|la(s)?)?)\\b.+", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 3 characters.
if (pos <= inputSpan.Length - 3)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_40190A5AE82B92C9577FE9A45CD09B22413116F9859390E6536F6EF2E5085EA1);
+ // The pattern has the literal "et" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_et_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_40190A5AE82B92C9577FE9A45CD09B22413116F9859390E6536F6EF2E5085EA1 = SearchValues.Create(["ET", "eT", "Et", "et"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "et".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_et_OrdinalIgnoreCase = SearchValues.Create(["et"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(em)\\b" (200 uses)[GeneratedRegex("\\b(em)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 2 characters.
if (pos <= inputSpan.Length - 2)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_00298CB1C9B37035848F363BE27E1EB54A4FE98FE07EEFB24B812417AC25856B);
+ // The pattern has the literal "em" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_em_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_00298CB1C9B37035848F363BE27E1EB54A4FE98FE07EEFB24B812417AC25856B = SearchValues.Create(["EM", "eM", "Em", "em"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "em".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_em_OrdinalIgnoreCase = SearchValues.Create(["em"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(avant)\\b" (195 uses)[GeneratedRegex("\\b(avant)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern matches a character in the set [Vv] at index 1.
- // Find the next occurrence. If it can't be found, there's no match.
- ReadOnlySpan<char> span = inputSpan.Slice(pos);
- for (int i = 0; i < span.Length - 4; i++)
+ // The pattern has the literal "avant" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_avant_OrdinalIgnoreCase);
+ if (i >= 0)
{
- int indexOfPos = span.Slice(i + 1).IndexOfAny('V', 'v');
- if (indexOfPos < 0)
- {
- goto NoMatchFound;
- }
- i += indexOfPos;
-
- // The primary set being searched for was found. 2 more sets will be checked so as
- // to minimize the number of places TryMatchAtCurrentPosition is run unnecessarily.
- // Make sure they fit in the remainder of the input.
- if ((uint)(i + 3) >= (uint)span.Length)
- {
- goto NoMatchFound;
- }
-
- if (((span[i + 3] | 0x20) == 'n') &&
- ((span[i] | 0x20) == 'a'))
- {
- base.runtextpos = pos + i;
- return true;
- }
+ base.runtextpos = pos + i;
+ return true;
}
}
// No match found.
- NoMatchFound:
base.runtextpos = inputSpan.Length;
return false;
}
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
+
+ /// <summary>Supports searching for the string "avant".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_avant_OrdinalIgnoreCase = SearchValues.Create(["avant"], StringComparison.OrdinalIgnoreCase);
}
}"(week)(\\s*)(?<number>\\d\\d|\\d|0\\d)" (194 uses)[GeneratedRegex("(week)(\\s*)(?<number>\\d\\d|\\d|0\\d)", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern matches a character in the set [Kk\u212A] at index 3.
- // Find the next occurrence. If it can't be found, there's no match.
- ReadOnlySpan<char> span = inputSpan.Slice(pos);
- for (int i = 0; i < span.Length - 4; i++)
+ // The pattern has the literal "wee" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_wee_OrdinalIgnoreCase);
+ if (i >= 0)
{
- int indexOfPos = span.Slice(i + 3).IndexOfAny('K', 'k', 'K');
- if (indexOfPos < 0)
- {
- goto NoMatchFound;
- }
- i += indexOfPos;
-
- if (((span[i] | 0x20) == 'w') &&
- ((span[i + 1] | 0x20) == 'e'))
- {
- base.runtextpos = pos + i;
- return true;
- }
+ base.runtextpos = pos + i;
+ return true;
}
}
// No match found.
- NoMatchFound:
base.runtextpos = inputSpan.Length;
return false;
}
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+
+ /// <summary>Supports searching for the string "wee".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_wee_OrdinalIgnoreCase = SearchValues.Create(["wee"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(entre\\s*(le|la(s)?)?)\\b" (194 uses)[GeneratedRegex("\\b(entre\\s*(le|la(s)?)?)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_3200475DE471EA58FF8C7B5F0CA4A9515EFACDBAA912EFAC506148E560A6D596);
+ // The pattern has the literal "entre" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_entre_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_3200475DE471EA58FF8C7B5F0CA4A9515EFACDBAA912EFAC506148E560A6D596 = SearchValues.Create(["ENTR", "eNTR", "EnTR", "enTR", "ENtR", "eNtR", "EntR", "entR", "ENTr", "eNTr", "EnTr", "enTr", "ENtr", "eNtr", "Entr", "entr"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "entre".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_entre_OrdinalIgnoreCase = SearchValues.Create(["entre"], StringComparison.OrdinalIgnoreCase);
}
}"(mes)(\\s*)((do|da|de))" (193 uses)[GeneratedRegex("(mes)(\\s*)((do|da|de))", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_DC6FBF049DFCA75A0085CE45822CFFFBACDEEEF2607AA4096D769AC2377EF021);
+ // The pattern has the literal "mes" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_mes_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_DC6FBF049DFCA75A0085CE45822CFFFBACDEEEF2607AA4096D769AC2377EF021 = SearchValues.Create(["MES", "mES", "MeS", "meS", "MEs", "mEs", "Mes", "mes"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "mes".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_mes_OrdinalIgnoreCase = SearchValues.Create(["mes"], StringComparison.OrdinalIgnoreCase);
}
}"(semana)(\\s*)((do|da|de))" (193 uses)[GeneratedRegex("(semana)(\\s*)((do|da|de))", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 8 characters.
if (pos <= inputSpan.Length - 8)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_1B7E1CD8AF955A2769ABD6F7FC469F9212B5B795E7DC6CF668A8EE08D2419045);
+ // The pattern has the literal "semana" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_semana_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_1B7E1CD8AF955A2769ABD6F7FC469F9212B5B795E7DC6CF668A8EE08D2419045 = SearchValues.Create(["SEMA", "sEMA", "SeMA", "seMA", "SEmA", "sEmA", "SemA", "semA", "SEMa", "sEMa", "SeMa", "seMa", "SEma", "sEma", "Sema", "sema"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "semana".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_semana_OrdinalIgnoreCase = SearchValues.Create(["semana"], StringComparison.OrdinalIgnoreCase);
}
}For more diff examples, see https://gist.github.com/MihuBot/4212adf85284694d34d378af2233fa23 JIT assembly changesFor a list of JIT diff regressions, see Regressions.md Sample source code for further analysisconst string JsonPath = "RegexResults-1792.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FHwNbpHA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
incidentally for this - /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04 = SearchValues.Create(["FROM", "fROM", "FrOM", "frOM", "FRoM", "fRoM", "FroM", "froM", "FROm", "fROm", "FrOm", "frOm", "FRom", "fRom", "From", "from"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "from".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_from_OrdinalIgnoreCase = SearchValues.Create(["from"], StringComparison.OrdinalIgnoreCase);Should SearchValues anyway recognize it's been given all case variations and collapse to a single one with ignore case? |
When TryGetOrdinalCaseInsensitiveString recurses into an inner
Concatenation (from an unwrapped Capture) and only partially consumes
it, stop iterating the outer Concatenation. Otherwise subsequent
siblings are incorrectly appended to the prefix string. For example
(abcde|abcfg)\( was producing 'abc(' instead of 'abc'.
Co-authored-by: Copilot <[email protected]>
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexFindOptimizationsTests.cs
Outdated
Show resolved
Hide resolved
Add tests for adjacent captures, captures at non-zero position, single-char captures (unwrap to Set), empty captures (unwrap to Empty), partial inner Concatenate consumption with different trailing node kinds, and Atomic groups (documents current conservative behavior). Co-authored-by: Copilot <[email protected]>
Atomic groups only affect backtracking behavior, not what text is matched, so they can safely be unwrapped during prefix analysis just like Capture nodes. This allows patterns like ab(?>cd)ef with IgnoreCase to extract the full prefix 'abcdef'. Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
With AI we spotted the Atomic case can easily be handled here as well. And we added all the tests we can collectively think of. |
|
@MihuBot regexdiff |
|
Verified each fixe including atomic cause test failures if they are removed. |
|
180 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"\\b(in)\\b" (658 uses)[GeneratedRegex("\\b(in)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 2 characters.
if (pos <= inputSpan.Length - 2)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_409072BF36F03A4496ACC585815833300ABA306360D979616ACDCED385DDC8FB);
+ // The pattern has the literal "in" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_in_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_409072BF36F03A4496ACC585815833300ABA306360D979616ACDCED385DDC8FB = SearchValues.Create(["IN", "iN", "In", "in"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "in".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_in_OrdinalIgnoreCase = SearchValues.Create(["in"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(from).+(to)\\b.+" (316 uses)[GeneratedRegex("\\b(from).+(to)\\b.+", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 8 characters.
if (pos <= inputSpan.Length - 8)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04);
+ // The pattern has the literal "from" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_from_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04 = SearchValues.Create(["FROM", "fROM", "FrOM", "frOM", "FRoM", "fRoM", "FroM", "froM", "FROm", "fROm", "FrOm", "frOm", "FRom", "fRom", "From", "from"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "from".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_from_OrdinalIgnoreCase = SearchValues.Create(["from"], StringComparison.OrdinalIgnoreCase);
}
}"(DATEADD|DATEPART)\\(\\s*(YEAR|Y|YY|YYYY|MON ..." (294 uses)[GeneratedRegex("(DATEADD|DATEPART)\\(\\s*(YEAR|Y|YY|YYYY|MONTH|MM|M|DAYOFYEAR|DY|DAY|DD|D|WEEKDAY|DW|HOUR|HH|MINUTE|MI|N|SECOND|SS|S|MILLISECOND|MS)\\s*\\,", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant)] // Any possible match is at least 10 characters.
if (pos <= inputSpan.Length - 10)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_OrdinalIgnoreCase_2AC5E9CD8492EE9AF8BE2E7D112B6E7B0E2EB16F4F0FF47ECAA2B811EE26A081);
+ // The pattern has the literal "date" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_date_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_OrdinalIgnoreCase_2AC5E9CD8492EE9AF8BE2E7D112B6E7B0E2EB16F4F0FF47ECAA2B811EE26A081 = SearchValues.Create(["dateadd", "datepart"], StringComparison.OrdinalIgnoreCase);
+ /// <summary>Supports searching for the string "date".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_date_OrdinalIgnoreCase = SearchValues.Create(["date"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(et\\s*(le|la(s)?)?)\\b.+" (291 uses)[GeneratedRegex("\\b(et\\s*(le|la(s)?)?)\\b.+", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 3 characters.
if (pos <= inputSpan.Length - 3)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_40190A5AE82B92C9577FE9A45CD09B22413116F9859390E6536F6EF2E5085EA1);
+ // The pattern has the literal "et" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_et_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_40190A5AE82B92C9577FE9A45CD09B22413116F9859390E6536F6EF2E5085EA1 = SearchValues.Create(["ET", "eT", "Et", "et"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "et".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_et_OrdinalIgnoreCase = SearchValues.Create(["et"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(em)\\b" (200 uses)[GeneratedRegex("\\b(em)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 2 characters.
if (pos <= inputSpan.Length - 2)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_00298CB1C9B37035848F363BE27E1EB54A4FE98FE07EEFB24B812417AC25856B);
+ // The pattern has the literal "em" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_em_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_00298CB1C9B37035848F363BE27E1EB54A4FE98FE07EEFB24B812417AC25856B = SearchValues.Create(["EM", "eM", "Em", "em"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "em".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_em_OrdinalIgnoreCase = SearchValues.Create(["em"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(avant)\\b" (195 uses)[GeneratedRegex("\\b(avant)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern matches a character in the set [Vv] at index 1.
- // Find the next occurrence. If it can't be found, there's no match.
- ReadOnlySpan<char> span = inputSpan.Slice(pos);
- for (int i = 0; i < span.Length - 4; i++)
+ // The pattern has the literal "avant" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_avant_OrdinalIgnoreCase);
+ if (i >= 0)
{
- int indexOfPos = span.Slice(i + 1).IndexOfAny('V', 'v');
- if (indexOfPos < 0)
- {
- goto NoMatchFound;
- }
- i += indexOfPos;
-
- // The primary set being searched for was found. 2 more sets will be checked so as
- // to minimize the number of places TryMatchAtCurrentPosition is run unnecessarily.
- // Make sure they fit in the remainder of the input.
- if ((uint)(i + 3) >= (uint)span.Length)
- {
- goto NoMatchFound;
- }
-
- if (((span[i + 3] | 0x20) == 'n') &&
- ((span[i] | 0x20) == 'a'))
- {
- base.runtextpos = pos + i;
- return true;
- }
+ base.runtextpos = pos + i;
+ return true;
}
}
// No match found.
- NoMatchFound:
base.runtextpos = inputSpan.Length;
return false;
}
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
+
+ /// <summary>Supports searching for the string "avant".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_avant_OrdinalIgnoreCase = SearchValues.Create(["avant"], StringComparison.OrdinalIgnoreCase);
}
}"(week)(\\s*)(?<number>\\d\\d|\\d|0\\d)" (194 uses)[GeneratedRegex("(week)(\\s*)(?<number>\\d\\d|\\d|0\\d)", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern matches a character in the set [Kk\u212A] at index 3.
- // Find the next occurrence. If it can't be found, there's no match.
- ReadOnlySpan<char> span = inputSpan.Slice(pos);
- for (int i = 0; i < span.Length - 4; i++)
+ // The pattern has the literal "wee" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_wee_OrdinalIgnoreCase);
+ if (i >= 0)
{
- int indexOfPos = span.Slice(i + 3).IndexOfAny('K', 'k', 'K');
- if (indexOfPos < 0)
- {
- goto NoMatchFound;
- }
- i += indexOfPos;
-
- if (((span[i] | 0x20) == 'w') &&
- ((span[i + 1] | 0x20) == 'e'))
- {
- base.runtextpos = pos + i;
- return true;
- }
+ base.runtextpos = pos + i;
+ return true;
}
}
// No match found.
- NoMatchFound:
base.runtextpos = inputSpan.Length;
return false;
}
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+
+ /// <summary>Supports searching for the string "wee".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_wee_OrdinalIgnoreCase = SearchValues.Create(["wee"], StringComparison.OrdinalIgnoreCase);
}
}"\\b(entre\\s*(le|la(s)?)?)\\b" (194 uses)[GeneratedRegex("\\b(entre\\s*(le|la(s)?)?)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_3200475DE471EA58FF8C7B5F0CA4A9515EFACDBAA912EFAC506148E560A6D596);
+ // The pattern has the literal "entre" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_entre_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_3200475DE471EA58FF8C7B5F0CA4A9515EFACDBAA912EFAC506148E560A6D596 = SearchValues.Create(["ENTR", "eNTR", "EnTR", "enTR", "ENtR", "eNtR", "EntR", "entR", "ENTr", "eNTr", "EnTr", "enTr", "ENtr", "eNtr", "Entr", "entr"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "entre".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_entre_OrdinalIgnoreCase = SearchValues.Create(["entre"], StringComparison.OrdinalIgnoreCase);
}
}"(mes)(\\s*)((do|da|de))" (193 uses)[GeneratedRegex("(mes)(\\s*)((do|da|de))", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_DC6FBF049DFCA75A0085CE45822CFFFBACDEEEF2607AA4096D769AC2377EF021);
+ // The pattern has the literal "mes" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_mes_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_DC6FBF049DFCA75A0085CE45822CFFFBACDEEEF2607AA4096D769AC2377EF021 = SearchValues.Create(["MES", "mES", "MeS", "meS", "MEs", "mEs", "Mes", "mes"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "mes".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_mes_OrdinalIgnoreCase = SearchValues.Create(["mes"], StringComparison.OrdinalIgnoreCase);
}
}"(semana)(\\s*)((do|da|de))" (193 uses)[GeneratedRegex("(semana)(\\s*)((do|da|de))", RegexOptions.IgnoreCase | RegexOptions.Singleline)] // Any possible match is at least 8 characters.
if (pos <= inputSpan.Length - 8)
{
- // The pattern has multiple strings that could begin the match. Search for any of them.
- // If none can be found, there's no match.
- int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_1B7E1CD8AF955A2769ABD6F7FC469F9212B5B795E7DC6CF668A8EE08D2419045);
+ // The pattern has the literal "semana" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
+ // If it can't be found, there's no match.
+ int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_semana_OrdinalIgnoreCase);
if (i >= 0)
{
base.runtextpos = pos + i;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
- /// <summary>Supports searching for the specified strings.</summary>
- internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_1B7E1CD8AF955A2769ABD6F7FC469F9212B5B795E7DC6CF668A8EE08D2419045 = SearchValues.Create(["SEMA", "sEMA", "SeMA", "seMA", "SEmA", "sEmA", "SemA", "semA", "SEMa", "sEMa", "SeMa", "seMa", "SEma", "sEma", "Sema", "sema"], StringComparison.Ordinal);
+ /// <summary>Supports searching for the string "semana".</summary>
+ internal static readonly SearchValues<string> s_indexOfString_semana_OrdinalIgnoreCase = SearchValues.Create(["semana"], StringComparison.OrdinalIgnoreCase);
}
}For more diff examples, see https://gist.github.com/MihuBot/869d4c7474d361737472fabda64ba5f9 JIT assembly changesFor a list of JIT diff regressions, see Regressions.md Sample source code for further analysisconst string JsonPath = "RegexResults-1794.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FHzeoyU");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
|
Not ready for review again -- examining diffs -- Dan |
|
for Looking through more. |
Yeah, doing an OrdinalIgnoreCase search for "week" would find "WEEK" but not "WEEK" (that's a Kelvin there at the end) |
|
But even if it did, we currently have a cut-off where we won't use StringValues if there are more than 16 words, and there would be 24 for "week" if the kelvin sign were permitted |
|
here's another puzzle: |
|
Look in https://gist.github.com/MihuBot/869d4c7474d361737472fabda64ba5f9 for AI tried this against base, and it reproes there too. It has an explanation that I haven't validated. Open separate bug? |
|
Oh I see the confusion, I pasted the wrong pattern somehow in my original comment. |
|
It's probably due to how it ends up being represented in the node graph: I don't remember the code, but the prefix analyzer is likely stopping when it gets to the two alternation branches and rationalizes comparing a set against a concatenation. Just a guess. If that's accurate, we could improve such cases with more code. |
|
I'll open a separate bug for that. I think this is ready for re-review then. |

TryGetOrdinalCaseInsensitiveStringiterates the direct children of aConcatenatenode to extract an ordinal case-insensitive prefix string. It handlesOne,Multi,Set,Empty, and zero-width assertions — but when it encounters aCapturenode, it breaks out of the loop, never examining the content inside.For a pattern like
\b(in)\bwithIgnoreCase, the regex tree after lowering is:FindPrefixOrdinalCaseInsensitivedescends throughCapture(0)and callsTryGetOrdinalCaseInsensitiveStringon the innerConcatenate. At child index 1 (Capture(1)), the method breaks — it never finds"in". The pattern falls through to the slowerFixedDistanceSetspath (or, after #124736, uses the multi-string ordinalSearchValuespath with 4 case variants).This change unwraps
Capturenodes transparently and recurses into nestedConcatenatechildren, matching the behavior already present inFindPrefixesCore. This allows\b(in)\bwithIgnoreCaseto use the optimalLeadingString_OrdinalIgnoreCase_LeftToRightstrategy with a single"in"string andOrdinalIgnoreCasecomparison.Follows up on a codegen diff observed in #124736.
Source-generated code diff for
[GeneratedRegex(@"\b(in)\b", RegexOptions.IgnoreCase)]private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan) { int pos = base.runtextpos; // Any possible match is at least 2 characters. if (pos <= inputSpan.Length - 2) { - // The pattern has multiple strings that could begin the match. Search for any of them. - // If none can be found, there's no match. - int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_...); + // The pattern has the literal "in" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence. + // If it can't be found, there's no match. + int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_in_OrdinalIgnoreCase); if (i >= 0) { base.runtextpos = pos + i; return true; } } base.runtextpos = inputSpan.Length; return false; }