Enable case-sensitive LeadingStrings with frequency-based heuristic by danmoseley · Pull Request #124736 · dotnet/runtime

danmoseley · 2026-02-23T02:08:39Z

I thought it would be interesting to see whether AI could take another look at the commented out search strategy originally introduced by @stephentoub in #98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out.

AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses (credit to @BurntSushi). Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below).

This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions.

Note: a char frequency table already existed in RegexPrefixAnalyzer.cs for ranking which fixed-distance character sets are most selective. Our new table serves a different purpose: deciding whether to use LeadingStrings vs FixedDistanceSets at all. The two are complementary.

====

When a regex has multiple alternation prefixes (e.g. a|b|c|...), this change decides whether to use SearchValues<string> (Teddy/Aho-Corasick) or fall through to FixedDistanceSets (IndexOfAny) based on the frequency of the starting characters.

High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent IndexOfAny filters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior.

Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)

Benchmark	Baseline	PR	Ratio	Verdict
RegexRedux_1 (Compiled)	25.77ms	14.27ms	1.81x faster	Faster
Leipzig Tom.*river (Compiled)	6.13ms	1.87ms	3.28x faster	Faster
RegexRedux_5 (Compiled)	2.83ms	2.35ms	1.20x faster	Faster
Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice	--	--	--	Same
LeadingStrings_NonAscii (all variants)	--	--	--	Same
LeadingStrings_BinaryData (all variants)	--	--	--	Same

Leipzig win is because the pattern is Tom.{10,25}river|river.{10,25}Tom so there is a short prefix that is common in the text; with this change it notices r is common and T fairly common in English text, so it switches to SearchValues which looks for Tom and river simultaneously, causing far fewer false starts.

regex-redux win is because it was previously looking for short, very common prefixes naively, and now (specifically because the pattern chars are common) it changed to use SearchValues (ie Aho-Corasick/Teddy) to search for the longer strings simultaneously.

No regressions detected. All MannWhitney tests report Same for non-improved benchmarks.

Key design decisions

Frequency table: First 128 entries of Rust's BYTE_FREQUENCIES from @BurntSushi's aho-corasick crate
Threshold: Average rank >= 200 triggers LeadingStrings; below 200 falls through to FixedDistanceSets
Non-ASCII: Returns false (no frequency data), so the heuristic does not engage and behavior is unchanged

Companion benchmarks: dotnet/performance#5126

New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)

These benchmarks are from the companion PR dotnet/performance#5126.

BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley)
Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores

Benchmark	Options	Baseline	PR	Ratio	MannWhitney(3ms)
LeadingStrings_BinaryData	None	4,483 us	4,365 us	0.97	Same
LeadingStrings_BinaryData	Compiled	2,188 us	2,184 us	1.00	Same
LeadingStrings_BinaryData	NonBacktracking	3,734 us	3,725 us	1.00	Same
LeadingStrings_NonAscii Count	None	913 us	956 us	1.05	Same
LeadingStrings_NonAscii Count	Compiled	244 us	243 us	1.00	Same
LeadingStrings_NonAscii CountIgnoreCase	None	1,758 us	1,714 us	0.98	Same
LeadingStrings_NonAscii CountIgnoreCase	Compiled	258 us	250 us	0.97	Same
LeadingStrings_NonAscii Count	NonBacktracking	392 us	398 us	1.02	Same
LeadingStrings_NonAscii CountIgnoreCase	NonBacktracking	409 us	431 us	1.05	Same

Binary didn't regress even though it's ASCII with non English frequencies because the pattern has chars that are not particularly common in English, so it uses the old codepath. It's hard to hypothesize about what one might search for in a binary file; searching for a bunch of leading lower case ASCII chars might regress somewhat. We've never particularly tested on binary before, and I don't recall seeing any bugs mentioning binary, so I don't think this is particularly interesting.

NonASCII didn't regress since as previously mentioned, the non ASCII leading chars in the pattern (presumably likely for any searching of non ASCII text) causes it to choose the existing codepath.

All MannWhitney tests report Same -- no regressions on binary or non-ASCII input.

danmoseley · 2026-02-23T02:08:55Z

@MihuBot benchmark Regex

Copilot

Pull request overview

This PR enables case-sensitive multi-string prefix optimization for regex patterns by introducing a frequency-based heuristic to decide between SearchValues<string> (Teddy/Aho-Corasick) and IndexOfAny with character sets. Previously, case-sensitive prefixes were disabled due to regressions in patterns with low-frequency starting characters (e.g., uppercase letters, digits). The new heuristic uses empirical byte frequency data from Rust's aho-corasick crate to determine if starting characters are common enough in typical text to warrant multi-string search, or rare enough that IndexOfAny remains a better filter.

Changes:

Uncommented and enabled case-sensitive prefix optimization with a frequency guard
Added HasHighFrequencyStartingChars method to evaluate whether prefix starting characters are high-frequency (threshold >= 200)
Added AsciiCharFrequencyRank table containing the first 128 entries from BurntSushi's BYTE_FREQUENCIES data

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs

MihuBot · 2026-02-23T03:46:58Z

See benchmark results at https://gist.github.com/MihuBot/d9e8eb967e28c7cdfbcf0682a8b546b3

Copilot

Pull request overview

Copilot reviewed 21 out of 9433 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (4)

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs:232

leadingStringsFrequency uses -1 as the sentinel for "not computed / not applicable", but the subsequent checks use > 0. A valid computed frequency can be 0 (e.g., prefixes starting with '\x00' or other chars with 0 frequency in the table), which would incorrectly skip the LeadingStrings-vs-set comparison. Consider tracking availability via caseSensitivePrefixes is not null and checking leadingStringsFrequency >= 0 (or using a separate bool) so computed-0 still participates in the heuristic.

                if (leadingStringsFrequency > 0)
                {
                    bool preferLeadingStrings = true;

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs:307

Same sentinel issue as above: the fallback if (leadingStringsFrequency > 0) will skip using computed case-sensitive prefixes if the computed value is 0, potentially leaving FindMode as NoSearch when no other strategy is selected. Use an availability check like caseSensitivePrefixes is not null && leadingStringsFrequency >= 0 (or a dedicated flag) instead of > 0.

            // If we have case-sensitive leading prefixes and nothing else was selected, use them.
            if (leadingStringsFrequency > 0)
            {
                LeadingPrefixes = caseSensitivePrefixes!;

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexPrefixAnalyzer.cs:1512

The PR description says this change is based on Rust aho-corasick BYTE_FREQUENCIES ranks and a rank threshold (>= 200), but the implementation is using the existing RegexPrefixAnalyzer.Frequency table of percentage occurrences (generated from runtime/Gutenberg text) and compares summed percentages. Either the description needs updating to reflect the actual heuristic/table used, or the code needs to align with the described rank-based approach.

        /// <summary>Percent occurrences in source text (100 * char count / total count).</summary>
        internal static ReadOnlySpan<float> Frequency =>
        [
            0.000f /* '\x00' */, 0.000f /* '\x01' */, 0.000f /* '\x02' */, 0.000f /* '\x03' */, 0.000f /* '\x04' */, 0.000f /* '\x05' */, 0.000f /* '\x06' */, 0.000f /* '\x07' */,

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs:255

This PR introduces new behavior for RegexOptions.Compiled / NonBacktracking where case-sensitive LeadingStrings may now be selected based on a frequency heuristic. There are existing RegexFindOptimizationsTests, but none appear to cover this new decision logic in compiled/NB modes; adding a few targeted test cases (e.g., where compiled should choose LeadingStrings vs LeadingSet depending on starter frequency, and a non-ASCII starter that must not engage the heuristic) would help prevent future regressions.

                // Compute case-sensitive leading prefixes, but don't commit yet. We'll compare
                // their starting-char frequency against the best FixedDistanceSet below to decide
                // which strategy to use.
                caseSensitivePrefixes = RegexPrefixAnalyzer.FindPrefixes(root, ignoreCase: false) is { Length: > 1 } csp ? csp : null;
                leadingStringsFrequency = caseSensitivePrefixes is not null ? SumStartingCharFrequencies(caseSensitivePrefixes) : -1;
            }

            // Build up a list of all of the sets that are a fixed distance from the start of the expression.
            List<FixedDistanceSet>? fixedDistanceSets = RegexPrefixAnalyzer.FindFixedDistanceSets(root, thorough: !interpreter);
            Debug.Assert(fixedDistanceSets is null || fixedDistanceSets.Count != 0);

            // See if we can make a string of at least two characters long out of those sets.  We should have already caught
            // one at the beginning of the pattern, but there may be one hiding at a non-zero fixed distance into the pattern.
            if (fixedDistanceSets is not null &&
                FindFixedDistanceString(fixedDistanceSets) is (string String, int Distance) bestFixedDistanceString)
            {
                FindMode = FindNextStartingPositionMode.FixedDistanceString_LeftToRight;
                FixedDistanceLiteral = ('\0', bestFixedDistanceString.String, bestFixedDistanceString.Distance);
                return;
            }

            // As a backup, see if we can find a literal after a leading atomic loop.  That might be better than whatever sets we find, so
            // we want to know whether we have one in our pocket before deciding whether to use a leading set (we'll prefer a leading
            // set if it's something for which we can search efficiently).
            (RegexNode LoopNode, (char Char, string? String, StringComparison StringComparison, char[]? Chars) Literal)? literalAfterLoop = RegexPrefixAnalyzer.FindLiteralFollowingLeadingLoop(root);

            // If we got such sets, we'll likely use them.  However, if the best of them is something that doesn't support an efficient
            // search and we did successfully find a literal after an atomic loop we could search instead, we prefer the efficient search.
            // For example, if we have a negated set, we will still prefer the literal-after-an-atomic-loop because negated sets typically
            // contain _many_ characters (e.g. [^a] is everything but 'a') and are thus more likely to very quickly match, which means any
            // vectorization employed is less likely to kick in and be worth the startup overhead.
            if (fixedDistanceSets is not null)
            {
                // Sort the sets by "quality", such that whatever set is first is the one deemed most efficient to use.
                // In some searches, we may use multiple sets, so we want the subsequent ones to also be the efficiency runners-up.
                RegexPrefixAnalyzer.SortFixedDistanceSetsByQuality(fixedDistanceSets);

                // If we have case-sensitive leading prefixes, compare the frequency of their starting characters
                // against the best fixed-distance set's characters. If the best set isn't more selective than the
                // starting chars (i.e. its frequency is at least as high), prefer LeadingStrings (SearchValues)
                // which can match full multi-character prefixes simultaneously. Also prefer LeadingStrings when
                // the best set is negated or range-based (no Chars), since those are weak filters.
                if (leadingStringsFrequency > 0)
                {
                    bool preferLeadingStrings = true;
                    if (fixedDistanceSets[0].Chars is { } bestSetChars &&
                        !fixedDistanceSets[0].Negated)
                    {
                        ReadOnlySpan<float> frequency = RegexPrefixAnalyzer.Frequency;
                        Debug.Assert(frequency.Length == 128);
                        float bestSetFrequency = 0;
                        foreach (char c in bestSetChars)
                        {
                            bestSetFrequency += c < frequency.Length ? frequency[c] : 0;
                        }

                        preferLeadingStrings = bestSetFrequency >= leadingStringsFrequency;
                    }

                    if (preferLeadingStrings)
                    {
                        LeadingPrefixes = caseSensitivePrefixes!;
                        FindMode = FindNextStartingPositionMode.LeadingStrings_LeftToRight;
#if SYSTEM_TEXT_REGULAREXPRESSIONS
                        LeadingStrings = SearchValues.Create(LeadingPrefixes, StringComparison.Ordinal);
#endif
                        return;
                    }

danmoseley · 2026-02-23T04:32:57Z

@MihuBot benchmark Regex

danmoseley · 2026-02-23T04:33:23Z

I also ran regex tests locally to verify it's still good, so I think this ready for final (?) review.

danmoseley · 2026-02-23T04:44:19Z

In my local runs, i get these wins. all others unchanged, no regressions

Suite	Benchmark	Options	Mean (Main)	Mean (PR)	Ratio	Alloc (Main)	Alloc (PR)	Alloc Ratio
Leipzig	`Tom.{10,25}river\|river.{10,25}Tom`	Compiled	6,421.7 μs	1,169.9 μs	✅ 0.18	51 B	10 B	0.20
Leipzig	`Tom.{10,25}river\|river.{10,25}Tom`	NonBacktracking	7,183.8 μs	1,442.7 μs	✅ 0.20	16,244 B	4,428 B	0.27
Common	`SplitWords`	Compiled	2,517.20 ns	1,202.89 ns	✅ 0.48	7,432 B	7,432 B	1.00
Common	`MatchesWords`	Compiled	2,551.48 ns	1,240.66 ns	✅ 0.49	3,448 B	3,448 B	1.00
Common	`ReplaceWords`	Compiled	2,423.42 ns	1,233.94 ns	✅ 0.51	6,848 B	6,848 B	1.00
Common	`MatchWord`	Compiled	126.98 ns	64.76 ns	✅ 0.51	208 B	208 B	1.00
RegexRedux_1	`RegexRedux_1`	Compiled	37.44 ms	20.46 ms	✅ 0.55	3.39 MB	3.41 MB	1.01
RegexRedux_5	`RegexRedux_5`	Compiled	4.852 ms	3.558 ms	✅ 0.73	3.21 MB	3.21 MB	1.00
Common	`ReplaceWords`	IgnoreCase, Compiled	1,795.05 ns	1,488.70 ns	✅ 0.83	6,848 B	6,848 B	1.00
Common	`OneNodeBacktracking`	Compiled	79.92 ns	70.05 ns	✅ 0.88	-	-	NA

danmoseley · 2026-02-23T04:56:27Z

How would this affect C# (and F#) on the leaderboard at https://programming-language-benchmarks.vercel.app/problem/regex-redux? ( the original does not have a [GeneratedRegex] entry yet.)

Currently the top .NET entry is 6th (AOT+generated). I didn't measure AOT, but assuming the ratio is the same as for jit, we'd get to 3rd on the list.

Rust would still be over 2x faster and the main reason is very likely because its regex engine operates on UTF-8 and ours uses UTF-16 so there's just twice the bytes to process, meaning SIMD has to do more chunks,. ..

MihuBot · 2026-02-23T06:08:45Z

See benchmark results at https://gist.github.com/MihuBot/75eeb6e2a31171fca2c8dfc5a605f245

danmoseley · 2026-02-23T06:32:17Z

OK fishy mihubot was good before, but second run now doesn't match my local good results. Let me see

MihuBot · 2026-02-24T02:14:07Z

733 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs

"\\b(in)\\b" (658 uses)

[GeneratedRegex("\\b(in)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

                     // Any possible match is at least 2 characters.
                     if (pos <= inputSpan.Length - 2)
                     {
-                        // The pattern matches a character in the set [Nn] at index 1.
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 1; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_409072BF36F03A4496ACC585815833300ABA306360D979616ACDCED385DDC8FB);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i + 1).IndexOfAny('N', 'n');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            if (((span[i] | 0x20) == 'i'))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_409072BF36F03A4496ACC585815833300ABA306360D979616ACDCED385DDC8FB = SearchValues.Create(["IN", "iN", "In", "in"], StringComparison.Ordinal);
     }
 }

"([uú]ltim[oa])\\b" (548 uses)

[GeneratedRegex("([uú]ltim[oa])\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    char ch;
                     
                     // Any possible match is at least 6 characters.
                     if (pos <= inputSpan.Length - 6)
                     {
-                        // The pattern matches a character in the set [Mm] at index 4.
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 5; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_5F1D4359E5DF98DCF4B95FDCBFEF2A013E0C46941AEBD0349C6180A4A0372176);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i + 4).IndexOfAny('M', 'm');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0 \0 "[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\b\0UVuvÚÛúû")) &&
-                                ((span[i + 1] | 0x20) == 'l'))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_5F1D4359E5DF98DCF4B95FDCBFEF2A013E0C46941AEBD0349C6180A4A0372176 = SearchValues.Create(["ULT", "uLT", "ÚLT", "úLT", "UlT", "ulT", "ÚlT", "úlT", "ULt", "uLt", "ÚLt", "úLt", "Ult", "ult", "Últ", "últ"], StringComparison.Ordinal);
     }
 }

"^refs\\/heads\\/tags\\/(.*)|refs\\/heads\\/( ..." (391 uses)

[GeneratedRegex("^refs\\/heads\\/tags\\/(.*)|refs\\/heads\\/(.*)|refs\\/tags\\/(.*)|refs\\/(.*)|origin\\/tags\\/(.*)|origin\\/(.*)$")]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    char ch;
                     
                     // Any possible match is at least 5 characters.
                     if (pos <= inputSpan.Length - 5)
                     {
-                        // The pattern matches a character in the set [/i] at index 4.
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 4; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_30C2953D90C322A099E43E6AF5B7857C062C1C20BFC2B2BFB4A15149DFB3AFEF);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i + 4).IndexOfAny('/', 'i');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            if ((((ch = span[i + 3]) == 'g') | (ch == 's')) &&
-                                (((ch = span[i + 2]) == 'f') | (ch == 'i')))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
         
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_30C2953D90C322A099E43E6AF5B7857C062C1C20BFC2B2BFB4A15149DFB3AFEF = SearchValues.Create(["refs/heads/tags/", "refs/heads/", "refs/tags/", "refs/", "origin/tags/", "origin/"], StringComparison.Ordinal);
     }
 }

"\\b(from).+(to)\\b.+" (316 uses)

[GeneratedRegex("\\b(from).+(to)\\b.+", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

                     // Any possible match is at least 8 characters.
                     if (pos <= inputSpan.Length - 8)
                     {
-                        // The pattern matches a character in the set [Mm] at index 3.
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 7; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i + 3).IndexOfAny('M', 'm');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            if (((span[i] | 0x20) == 'f') &&
-                                ((span[i + 2] | 0x20) == 'o'))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_DA0DF7757216159252C4FA00AB5982AAA4403D2C43304873401C53E36F92CA04 = SearchValues.Create(["FROM", "fROM", "FrOM", "frOM", "FRoM", "fRoM", "FroM", "froM", "FROm", "fROm", "FrOm", "frOm", "FRom", "fRom", "From", "from"], StringComparison.Ordinal);
     }
 }

"\\b(?<year>((1[5-9]|20)\\d{2})|2100)\\b" (309 uses)

[GeneratedRegex("\\b(?<year>((1[5-9]|20)\\d{2})|2100)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    char ch;
-                    uint charMinusLowUInt32;
                     
                     // Any possible match is at least 4 characters.
                     if (pos <= inputSpan.Length - 4)
                     {
-                        // The pattern begins with a character in the set [12].
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 3; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_74309B1520D1FC139D75B2BB6987007481E9B777F41E19A028B21AB7FC28BA45);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i).IndexOfAny('1', '2');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            // The primary set being searched for was found. 2 more sets will be checked so as
-                            // to minimize the number of places TryMatchAtCurrentPosition is run unnecessarily.
-                            // Make sure they fit in the remainder of the input.
-                            if ((uint)(i + 2) >= (uint)span.Length)
-                            {
-                                goto NoMatchFound;
-                            }
-                            
-                            if (((int)((0xC7C00000U << (short)(charMinusLowUInt32 = (ushort)(span[i + 1] - '0'))) & (charMinusLowUInt32 - 32)) < 0) &&
-                                ((ch = span[i + 2]) < 128 ? char.IsAsciiDigit(ch) : RegexRunner.CharInClass((char)ch, "\0\u0002\u000101\t")))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_74309B1520D1FC139D75B2BB6987007481E9B777F41E19A028B21AB7FC28BA45 = SearchValues.Create(["15", "16", "17", "18", "19", "20", "2100"], StringComparison.Ordinal);
     }
 }

"\\b(et\\s*(le|la(s)?)?)\\b.+" (291 uses)

[GeneratedRegex("\\b(et\\s*(le|la(s)?)?)\\b.+", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

                     // Any possible match is at least 3 characters.
                     if (pos <= inputSpan.Length - 3)
                     {
-                        // The pattern matches a character in the set [Tt] at index 1.
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 2; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_40190A5AE82B92C9577FE9A45CD09B22413116F9859390E6536F6EF2E5085EA1);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i + 1).IndexOfAny('T', 't');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            if (((span[i] | 0x20) == 'e'))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_40190A5AE82B92C9577FE9A45CD09B22413116F9859390E6536F6EF2E5085EA1 = SearchValues.Create(["ET", "eT", "Et", "et"], StringComparison.Ordinal);
     }
 }

"\\b(https?://|ftp://|www\\.)[\\w\\d\\._/\\-~ ..." (267 uses)

[GeneratedRegex("\\b(https?://|ftp://|www\\.)[\\w\\d\\._/\\-~%@()+:?&=#!]*[\\w\\d/]")]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    char ch;
                     
                     // Any possible match is at least 5 characters.
                     if (pos <= inputSpan.Length - 5)
                     {
-                        // The pattern begins with a character in the set [fhw].
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 4; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_7E253910419F77A121C9FCE57096BB0770898849401D67ABDAF3E17D2F5F21FD);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i).IndexOfAny('f', 'h', 'w');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            // The primary set being searched for was found. 2 more sets will be checked so as
-                            // to minimize the number of places TryMatchAtCurrentPosition is run unnecessarily.
-                            // Make sure they fit in the remainder of the input.
-                            if ((uint)(i + 3) >= (uint)span.Length)
-                            {
-                                goto NoMatchFound;
-                            }
-                            
-                            if ((((ch = span[i + 3]) == '.') | (ch == ':') | (ch == 'p')) &&
-                                (((ch = span[i + 1]) == 't') | (ch == 'w')))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_7E253910419F77A121C9FCE57096BB0770898849401D67ABDAF3E17D2F5F21FD = SearchValues.Create(["http", "ftp://", "www."], StringComparison.Ordinal);
     }
 }

"\\b(?<unit>decennio?|ann[oi]|mes[ei]|settima ..." (255 uses)

[GeneratedRegex("\\b(?<unit>decennio?|ann[oi]|mes[ei]|settiman[ae]|giorn[oi])\\b", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    char ch;
-                    uint charMinusLowUInt32;
                     
                     // Any possible match is at least 4 characters.
                     if (pos <= inputSpan.Length - 4)
                     {
-                        // The pattern begins with a character in the set [adgms].
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 3; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_7F16DACEF575A8087BAF8FAC14BC47F4D2214D513F67C3250A958A2D52D6440A);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i).IndexOfAny(Utilities.s_ascii_92200800);
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            // The primary set being searched for was found. 2 more sets will be checked so as
-                            // to minimize the number of places TryMatchAtCurrentPosition is run unnecessarily.
-                            // Make sure they fit in the remainder of the input.
-                            if ((uint)(i + 2) >= (uint)span.Length)
-                            {
-                                goto NoMatchFound;
-                            }
-                            
-                            if ((((ch = span[i + 1]) == 'e') | (ch == 'i') | (ch == 'n')) &&
-                                ((int)((0x8018C000U << (short)(charMinusLowUInt32 = (ushort)(span[i + 2] - 'c'))) & (charMinusLowUInt32 - 32)) < 0))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
         
-        /// <summary>Supports searching for characters in or not in "adgms".</summary>
-        internal static readonly SearchValues<char> s_ascii_92200800 = SearchValues.Create("adgms");
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_7F16DACEF575A8087BAF8FAC14BC47F4D2214D513F67C3250A958A2D52D6440A = SearchValues.Create(["decenni", "anni", "anno", "mese", "mesi", "settiman", "giorni", "giorno"], StringComparison.Ordinal);
     }
 }

"(quest['oa]|corrente)" (244 uses)

[GeneratedRegex("(quest['oa]|corrente)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    char ch;
                     
                     // Any possible match is at least 6 characters.
                     if (pos <= inputSpan.Length - 6)
                     {
-                        // The pattern begins with a character in the set [cq].
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 5; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_763B6F2775AD886009FE306897080DA0D914A7D65E7AECC2B25D4344D9F8EF63);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i).IndexOfAny('c', 'q');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            // The primary set being searched for was found. 2 more sets will be checked so as
-                            // to minimize the number of places TryMatchAtCurrentPosition is run unnecessarily.
-                            // Make sure they fit in the remainder of the input.
-                            if ((uint)(i + 3) >= (uint)span.Length)
-                            {
-                                goto NoMatchFound;
-                            }
-                            
-                            if ((((ch = span[i + 1]) == 'o') | (ch == 'u')) &&
-                                char.IsBetween(span[i + 3], 'r', 's'))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
         
         /// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
         internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_763B6F2775AD886009FE306897080DA0D914A7D65E7AECC2B25D4344D9F8EF63 = SearchValues.Create(["quest'", "questa", "questo", "corrente"], StringComparison.Ordinal);
     }
 }

"\\b(da(l(l[oae'])?|i|gli)?|tra|fra|entro)(\\ ..." (228 uses)

[GeneratedRegex("\\b(da(l(l[oae'])?|i|gli)?|tra|fra|entro)(\\s+(il|l[aeo']|gli|i))?\\b", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]

                 private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
                 {
                     int pos = base.runtextpos;
-                    uint charMinusLowUInt32;
                     
                     // Any possible match is at least 2 characters.
                     if (pos <= inputSpan.Length - 2)
                     {
-                        // The pattern matches a character in the set [anr] at index 1.
-                        // Find the next occurrence. If it can't be found, there's no match.
-                        ReadOnlySpan<char> span = inputSpan.Slice(pos);
-                        for (int i = 0; i < span.Length - 1; i++)
+                        // The pattern has multiple strings that could begin the match. Search for any of them.
+                        // If none can be found, there's no match.
+                        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_Ordinal_99936EDA919D5A1A2F2418047924F8D2D31078642E1958EC7E8A32CF569E2511);
+                        if (i >= 0)
                         {
-                            int indexOfPos = span.Slice(i + 1).IndexOfAny('a', 'n', 'r');
-                            if (indexOfPos < 0)
-                            {
-                                goto NoMatchFound;
-                            }
-                            i += indexOfPos;
-                            
-                            if (((int)((0xE0008000U << (short)(charMinusLowUInt32 = (ushort)(span[i] - 'd'))) & (charMinusLowUInt32 - 32)) < 0))
-                            {
-                                base.runtextpos = pos + i;
-                                return true;
-                            }
+                            base.runtextpos = pos + i;
+                            return true;
                         }
                     }
                     
                     // No match found.
-                    NoMatchFound:
                     base.runtextpos = inputSpan.Length;
                     return false;
                 }
             0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
             0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
         };
+        
+        /// <summary>Supports searching for the specified strings.</summary>
+        internal static readonly SearchValues<string> s_indexOfAnyStrings_Ordinal_99936EDA919D5A1A2F2418047924F8D2D31078642E1958EC7E8A32CF569E2511 = SearchValues.Create(["da", "tra", "fra", "entro"], StringComparison.Ordinal);
     }
 }

For more diff examples, see https://gist.github.com/MihuBot/e53dc1d081be99a109ec2d2700619645

JIT assembly changes

Total bytes of base: 54227147
Total bytes of diff: 54284087
Total bytes of delta: 56940 (0.11 % of base)
Total relative delta: 972.36
    diff is a regression.
    relative diff is a regression.

For a list of JIT diff regressions, see Regressions.md
For a list of JIT diff improvements, see Improvements.md

Sample source code for further analysis

const string JsonPath = "RegexResults-1788.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FHqnss4");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

MihuBot · 2026-02-24T02:25:22Z

See benchmark results at https://gist.github.com/MihuBot/2bd6f75ed057df7b8fdd4b847f57e444

stephentoub · 2026-02-24T02:26:04Z

Some of the diffs are confusing to me, like why this one wasn't previously already handled by the case-insensitive support:

[GeneratedRegex("\\b(in)\\b", RegexOptions.IgnoreCase | RegexOptions.Singleline)]

MihuBot · 2026-02-24T04:12:09Z

See benchmark results at https://gist.github.com/MihuBot/709e349b42165b2fbd1de51500f56381

danmoseley · 2026-02-24T05:20:15Z

@MihuBot benchmark Regex

danmoseley · 2026-02-24T05:20:38Z

(I think we didn't get mihubot on x64 on the last commit, I'll do this to compare with ARM64

danmoseley · 2026-02-24T05:57:06Z

Below is the AI analysis of the code gen issue. which is correct as far as I can tell.

Question is whether we should explore adding the proposed "fix" to this PR. It seems like this code gen diff should itself be an improvement, albeit inadvertent. These are not patterns from the benchmarks though, so I'll measure one locally.

I spot checked a bunch of diffs and they all have this pattern, btw, so seems like just this one explanation.

===

For \b(in)\b with IgnoreCase | Singleline, the regex tree (after lowering) is:

Capture(0)
  Concatenate
    Boundary          ← \b
    Capture(1)        ← the (in) group
      Concatenate
        Set([Ii])     ← 'i' lowered to case-insensitive set
        Set([Nn])     ← 'n' lowered to case-insensitive set
    Boundary          ← \b

There are two pre-existing gaps that cause this pattern to fall through:

Gap 1: FindPrefixOrdinalCaseInsensitive can't see through Capture groups (line 163)

TryGetOrdinalCaseInsensitiveString (RegexNode.cs:2957) iterates the direct children of the Concatenate. It handles One, Multi, Set, Empty, and zero-width assertions — but when it hits Capture(1), it falls into the else branch (line 3019) and breaks. It never sees the "in" inside the capture. Result: returns null.

Gap 2: FindPrefixes(ignoreCase: true) returns only 1 prefix (line 175)

FindPrefixes can navigate through Capture nodes (line 84-87). It successfully finds the prefix "in" — but returns it as a single-element array ["in"]. The check at line 175 requires { Length: > 1 } (more than 1 string), so it fails.

The > 1 check was designed to exclude single-prefix cases that "should be" handled by FindPrefixOrdinalCaseInsensitive above — but since that also fails (Gap 1), the pattern falls through to FixedDistanceSets entirely.

How the PR "rescues" it

The PR's new code (line 218-220) calls FindPrefixes(root, ignoreCase: false), which case-expands the sets into 4 ordinal variants: ["IN", "iN", "In", "in"]. This passes { Length: > 1 } and uses SearchValues.Create(..., StringComparison.Ordinal).

Stephen's point

This works but is suboptimal — ideally Gap 1 should be fixed so TryGetOrdinalCaseInsensitiveString descends through Capture nodes (same as it already handles zero-width assertions). Then the pattern would use LeadingString_OrdinalIgnoreCase_LeftToRight with a single "in" string and OrdinalIgnoreCase comparison, which is cleaner and likely faster than searching for 4 ordinal variants.

The fix would be a one-line addition around RegexNode.cs:3014:

else if (child.Kind is RegexNodeKind.Capture)
{
    // Descend into capture group to find the string inside
    // (similar to how FindPrefixesCore handles Capture)
}

...though that would require restructuring since the method iterates children flat rather than recursing.

danmoseley · 2026-02-24T06:05:34Z

Looks like for this example pattern it's same/an improvement (of course depends on the particular text). I guess searching for in|In|iN|IN is faster than searching for n and each time backing up for i. In the text I used, half the n weren't preceded by i

Benchmark	Baseline (main)	PR	Ratio	Verdict
`\b(in)\b` IgnoreCase	4,557 ns	4,241 ns	0.93	~7% faster
`\bin\b` IgnoreCase (no capture)	2,649 ns	2,731 ns	1.03	Same
`\b(from).+(to)\b.+` IgnoreCase	79.9 ns	79.7 ns	1.00	Same

I think I need guidance on whether I should pursue fixing this in this PR. Either way, we'd have a diff: it would just be a code improvement to code not changed in this PR.

-- Dan

danmoseley · 2026-02-24T06:08:14Z

@MihuBot benchmark Regex https://github.com/MihaZupan/performance/tree/compiled-regex-only -medium -arm

MihuBot · 2026-02-24T06:56:55Z

See benchmark results at https://gist.github.com/MihuBot/1bd2835a353c7b0b1f70f8caef9848e8

MihuBot · 2026-02-24T09:31:26Z

See benchmark results at https://gist.github.com/MihuBot/8d26ffeae5729831b9bbdcbf1fc3c316

stephentoub · 2026-02-24T15:24:27Z

I think I need guidance on whether I should pursue fixing this in this PR.

No, we can do it separately

danmoseley · 2026-02-24T15:32:41Z

OK, AI analysis of the mihubot numbers on the final commit, comparing the two architectures:

MihuBot Benchmark Summary — Both Architectures on Latest Commit (`21a68d75`)

x64: AMD EPYC 9V74 (gist)
ARM64: Neoverse-N2 (gist)

All results below are Compiled mode. Only benchmarks with significant change (ratio ≤ 0.95) on at least one architecture are shown. Everything else is ~1.00. No regressions on either
architecture.

Benchmark	x64 Ratio	ARM64 Ratio	x64 Speedup	ARM64 Speedup
Leipzig `Tom.{10,25}river\|river.{10,25}Tom`	0.18	0.48	~5.6x	~2.1x
Common `ReplaceWords`	0.46	0.76	~2.2x	~1.3x
Common `SplitWords`	0.47	0.77	~2.1x	~1.3x
Common `MatchWord`	0.51	0.71	~2.0x	~1.4x
Common `MatchesWords`	0.52	0.76	~1.9x	~1.3x
RegexRedux_1	0.55	0.66	~1.8x	~1.5x
RegexRedux_5	0.74	0.94	~1.4x	~1.1x
Common `ReplaceWords` (IgnoreCase)	0.80	1.00	~1.3x	—

x64 (AVX2, 256-bit) consistently shows ~1.5–2x larger improvements than ARM64 (NEON, 128-bit), as expected for the Teddy multi-string SIMD search algorithm. The Leipzig pattern sees the biggest gap: 5.6x on x64 vs 2.1x on ARM64.

danmoseley · 2026-02-24T16:13:15Z

I think we're in a good shape now and everything's addressed? I can follow up on the code diff issue mentioned when this is merged.

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs

stephentoub

Thanks.

danmoseley · 2026-02-24T22:40:14Z

@MihuBot benchmark Regex https://github.com/MihaZupan/performance/tree/compiled-regex-only

danmoseley · 2026-02-24T22:45:49Z

I'll merge once bot confirms there are no regressions. Which I already confirmed locally anyway: zero changes since last commit.

MihuBot · 2026-02-24T23:31:21Z

See benchmark results at https://gist.github.com/MihuBot/2d25bc04c96cbc8927976c614bdf1ebb

danmoseley · 2026-02-24T23:56:11Z

Mihubot shows 4 regressions, but running (again) locally shows they're noise of some sort --

Benchmark	Options	PR-latest (ns)	Negated-fix (ns)	Ratio
`MatchesSet`	Compiled	25,294	25,205	1.00
`MatchesSet`	IgnoreCase, Compiled	25,332	24,635	0.97
`MatchesWords`	Compiled	1,039	1,022	0.98
`MatchesWords`	IgnoreCase, Compiled	1,137	1,136	1.00

above against 2nd last commit vs last commit.

Benchmark	Options	Main (ns)	PR latest (ns)	Ratio
`MatchesSet`	Compiled	24,558	25,040	1.02
`MatchesSet`	IgnoreCase, Compiled	25,045	24,896	0.99
`MatchesWords`	Compiled	1,779	960	0.54 ✅
`MatchesWords`	IgnoreCase, Compiled	1,161	1,117	0.96

this is base vs latest commit. Mihubot is noise. good to merge.

Copilot AI review requested due to automatic review settings February 23, 2026 02:08

github-actions bot added the area-System.Text.RegularExpressions label Feb 23, 2026

dotnet-policy-service bot assigned danmoseley Feb 23, 2026

Copilot started reviewing on behalf of danmoseley February 23, 2026 02:09 View session

danmoseley mentioned this pull request Feb 23, 2026

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns dotnet/performance#5126

Merged

MihuBot mentioned this pull request Feb 23, 2026

[Benchmark X64] [danmoseley] Enable case-sensitive LeadingStrings with frequ ... MihuBot/runtime-utils#1779

Open

Copilot AI reviewed Feb 23, 2026

View reviewed changes

stephentoub reviewed Feb 23, 2026

View reviewed changes

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs Outdated Show resolved Hide resolved

danmoseley force-pushed the regex-redux/leading-strings-frequency branch from fc2a7e2 to eb39721 Compare February 23, 2026 02:31

stephentoub reviewed Feb 23, 2026

View reviewed changes

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs Outdated Show resolved Hide resolved

stephentoub reviewed Feb 23, 2026

View reviewed changes

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs Outdated Show resolved Hide resolved

danmoseley force-pushed the regex-redux/leading-strings-frequency branch from eb39721 to e877181 Compare February 23, 2026 04:19

Copilot AI review requested due to automatic review settings February 23, 2026 04:19

Copilot AI reviewed Feb 23, 2026

View reviewed changes

danmoseley force-pushed the regex-redux/leading-strings-frequency branch 2 times, most recently from da12aa4 to 3744268 Compare February 23, 2026 04:24

Copilot AI review requested due to automatic review settings February 23, 2026 04:24

Copilot started reviewing on behalf of danmoseley February 23, 2026 04:25 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

MihuBot mentioned this pull request Feb 23, 2026

[Benchmark X64] [danmoseley] Enable case-sensitive LeadingStrings with frequ ... MihuBot/runtime-utils#1780

Open

danmoseley mentioned this pull request Feb 23, 2026

Enable case-sensitive LeadingStrings with frequency-based heuristic danmoseley/runtime#31

Closed

build-analysis bot mentioned this pull request Feb 23, 2026

[android][clr] No peer certificates when executing System.Net.Http.Functional.Tests on Android emulator #124526

Open

MihuBot mentioned this pull request Feb 24, 2026

[Benchmark X64] [danmoseley] Enable case-sensitive LeadingStrings with frequ ... MihuBot/runtime-utils#1789

Open

MihuBot mentioned this pull request Feb 24, 2026

[Benchmark ARM64] [danmoseley] Enable case-sensitive LeadingStrings with fre ... MihuBot/runtime-utils#1790

Open

stephentoub reviewed Feb 24, 2026

View reviewed changes

.../System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs Outdated Show resolved Hide resolved

Fix negation check

264d8ef

stephentoub approved these changes Feb 24, 2026

View reviewed changes

MihuBot mentioned this pull request Feb 24, 2026

[Benchmark X64] [danmoseley] Enable case-sensitive LeadingStrings with frequ ... MihuBot/runtime-utils#1791

Open

danmoseley mentioned this pull request Feb 24, 2026

Handle Capture nodes in TryGetOrdinalCaseInsensitiveString danmoseley/runtime#33

Closed

danmoseley enabled auto-merge (squash) February 24, 2026 23:56

danmoseley merged commit b613202 into dotnet:main Feb 25, 2026
88 of 90 checks passed

danmoseley mentioned this pull request Feb 25, 2026

Handle Capture nodes in TryGetOrdinalCaseInsensitiveString #124842

Open

dotnet-maestro bot mentioned this pull request Feb 25, 2026

[main] Source code updates from dotnet/runtime dotnet/dotnet#5088

Merged

danmoseley deleted the regex-redux/leading-strings-frequency branch February 26, 2026 03:19

Conversation

danmoseley commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms)

Key design decisions

New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot)

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MihuBot commented Feb 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

danmoseley commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MihuBot commented Feb 23, 2026

Uh oh!

danmoseley commented Feb 23, 2026

Uh oh!

MihuBot commented Feb 24, 2026

Uh oh!

MihuBot commented Feb 24, 2026

Uh oh!

stephentoub commented Feb 24, 2026

Uh oh!

MihuBot commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

MihuBot commented Feb 24, 2026

Uh oh!

MihuBot commented Feb 24, 2026

Uh oh!

stephentoub commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MihuBot Benchmark Summary — Both Architectures on Latest Commit (21a68d75)

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

Uh oh!

stephentoub left a comment

Choose a reason for hiding this comment

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 23, 2026 •

edited

Loading

danmoseley commented Feb 23, 2026 •

edited

Loading

danmoseley commented Feb 23, 2026 •

edited

Loading

danmoseley commented Feb 24, 2026 •

edited

Loading

danmoseley commented Feb 24, 2026 •

edited

Loading

danmoseley commented Feb 24, 2026 •

edited

Loading

MihuBot Benchmark Summary — Both Architectures on Latest Commit (`21a68d75`)