Add LeadingStrings benchmarks for binary and non-ASCII regex patterns by danmoseley · Pull Request #5126 · dotnet/performance

danmoseley · 2026-02-23T02:09:33Z

Adds two new benchmark classes to exercise the LeadingStrings vs FixedDistanceSets heuristic in the regex engine:

Perf_Regex_LeadingStrings_BinaryData: 1MB binary corpus (PE-header-like seed duplicated), alternation of binary patterns. Validates no regression on non-text input. (Lots of ASCII here, but obviously not English frequencies.)
Perf_Regex_LeadingStrings_NonAscii: ~100KB Russian text (Anna Karenina opening), alternation of Russian words. Validates no regression on non-ASCII text where the frequency heuristic bails out.

Companion to dotnet/runtime change: dotnet/runtime#124736

Copilot

Pull request overview

This PR adds two new benchmark classes to test regex alternation pattern performance with different input types. The benchmarks are designed to validate that the LeadingStrings vs FixedDistanceSets heuristic in the regex engine doesn't regress on binary and non-ASCII data.

Changes:

Added Perf_Regex_LeadingStrings_BinaryData class with benchmarks for binary data patterns (1MB PE-header-like corpus)
Added Perf_Regex_LeadingStrings_NonAscii class with benchmarks for Russian text patterns (~100KB from Anna Karenina)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs

Co-authored-by: Copilot <[email protected]>

Remove Perf_Regex_LeadingStrings_BinaryData, keeping only the non-ASCII benchmark. Co-authored-by: Copilot <[email protected]>

Port curated/01-literal and curated/02-literal-alternate benchmarks from https://github.com/BurntSushi/rebar for Russian and Chinese text. Haystacks are OpenSubtitles data (https://opus.nlpl.eu/OpenSubtitles-v2018.php). This replaces the ad-hoc NonAscii benchmark with well-established cross-engine regex benchmarks that test literal and alternation search on non-ASCII text. Co-authored-by: Copilot <[email protected]>

danmoseley · 2026-02-24T05:08:17Z

Benchmark results comparing main (base) vs 21a68d75 (test: "Skip frequency heuristic for single-char FixedDistanceSets")

Built from dotnet/runtime repo. Mann-Whitney U test, two-sided.

Lang	Pattern	Options	Base(us)	Test(us)	Ratio	p-value	Sig
Chinese	literal	Compiled	25.45	25.80	1.01	0.490
Chinese	literal	NonBacktracking	24.27	14.94	0.62	<0.001	***
Chinese	literal	None	21.92	23.07	1.05	0.001	**
Chinese	alternation	Compiled	25.13	27.31	1.09	0.044	*
Chinese	alternation	NonBacktracking	31.03	28.87	0.93	0.378
Chinese	alternation	None	45.93	44.64	0.97	0.023	*
Russian	(?i) literal	Compiled	110.39	110.95	1.00	0.322
Russian	(?i) literal	NonBacktracking	161.59	160.01	0.99	0.106
Russian	(?i) literal	None	190.67	191.38	1.00	0.120
Russian	(?i) alternation	Compiled	1,126.60	1,152.90	1.02	0.044	*
Russian	(?i) alternation	NonBacktracking	7,629.45	7,767.19	1.02	<0.001	***
Russian	(?i) alternation	None	11,849.64	11,796.66	1.00	0.071
Russian	literal	Compiled	48.79	49.88	1.02	0.543
Russian	literal	NonBacktracking	81.00	88.57	1.09	0.102
Russian	literal	None	58.27	57.69	0.99	0.867
Russian	alternation	Compiled	1,214.60	1,228.94	1.01	0.086
Russian	alternation	NonBacktracking	2,331.08	3,208.26	1.38	<0.001	***
Russian	alternation	None	3,074.01	3,164.35	1.03	0.017	*

* p<0.05, ** p<0.01, *** p<0.001

danmoseley · 2026-02-24T05:11:02Z

Hmm, speedups are nice except I wouldn't expect any as this is non ASCII

danmoseley · 2026-02-24T05:34:40Z

Reran with more iterations and the tests are stable and at parity. The issue is some of them are super short (a few MB all in CPU cache doing IndexOf with SIMD) so I needed to do more iterations, which I guess the perf lab would. Merging.

@stephentoub

…124736) I thought it would be interesting to see whether AI could take another look at the [commented out search strategy](https://github.com/dotnet/runtime/blob/99b76018b6e4/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs#L152-L161) originally introduced by @stephentoub in #98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out. AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses (credit to @BurntSushi). Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below). This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions. Note: a char frequency table already existed in `RegexPrefixAnalyzer.cs` for ranking which fixed-distance character sets are most selective. Our new table serves a different purpose: deciding whether to use `LeadingStrings` vs `FixedDistanceSets` at all. The two are complementary. ==== When a regex has multiple alternation prefixes (e.g. `a|b|c|...`), this change decides whether to use `SearchValues<string>` (Teddy/Aho-Corasick) or fall through to `FixedDistanceSets` (`IndexOfAny`) based on the frequency of the starting characters. High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent `IndexOfAny` filters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior. ## Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms) | Benchmark | Baseline | PR | Ratio | Verdict | |-----------|----------|-----|-------|---------| | RegexRedux_1 (Compiled) | 25.77ms | 14.27ms | **1.81x faster** | Faster | | Leipzig Tom.*river (Compiled) | 6.13ms | 1.87ms | **3.28x faster** | Faster | | RegexRedux_5 (Compiled) | 2.83ms | 2.35ms | **1.20x faster** | Faster | | Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice | -- | -- | -- | Same | | LeadingStrings_NonAscii (all variants) | -- | -- | -- | Same | | LeadingStrings_BinaryData (all variants) | -- | -- | -- | Same | Leipzig win is because the pattern is `Tom.{10,25}river|river.{10,25}Tom` so there is a short prefix that is common in the text; with this change it notices `r` is common and `T` fairly common in English text, so it switches to `SearchValues` which looks for `Tom` and `river` simultaneously, causing far fewer false starts. regex-redux win is because it was previously looking for short, very common prefixes naively, and now (specifically because the pattern chars are common) it changed to use `SearchValues` (ie Aho-Corasick/Teddy) to search for the longer strings simultaneously. No regressions detected. All MannWhitney tests report Same for non-improved benchmarks. ## Key design decisions - **Frequency table**: First 128 entries of Rust's `BYTE_FREQUENCIES` from @BurntSushi's aho-corasick crate - **Threshold**: Average rank >= 200 triggers `LeadingStrings`; below 200 falls through to `FixedDistanceSets` - **Non-ASCII**: Returns false (no frequency data), so the heuristic does not engage and behavior is unchanged Companion benchmarks: dotnet/performance#5126 ## New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot) These benchmarks are from the companion PR dotnet/performance#5126. ``` BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley) Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores ``` | Benchmark | Options | Baseline | PR | Ratio | MannWhitney(3ms) | |-----------|---------|----------|-----|-------|------------------| | LeadingStrings_BinaryData | None | 4,483 us | 4,365 us | 0.97 | Same | | LeadingStrings_BinaryData | Compiled | 2,188 us | 2,184 us | 1.00 | Same | | LeadingStrings_BinaryData | NonBacktracking | 3,734 us | 3,725 us | 1.00 | Same | | LeadingStrings_NonAscii Count | None | 913 us | 956 us | 1.05 | Same | | LeadingStrings_NonAscii Count | Compiled | 244 us | 243 us | 1.00 | Same | | LeadingStrings_NonAscii CountIgnoreCase | None | 1,758 us | 1,714 us | 0.98 | Same | | LeadingStrings_NonAscii CountIgnoreCase | Compiled | 258 us | 250 us | 0.97 | Same | | LeadingStrings_NonAscii Count | NonBacktracking | 392 us | 398 us | 1.02 | Same | | LeadingStrings_NonAscii CountIgnoreCase | NonBacktracking | 409 us | 431 us | 1.05 | Same | Binary didn't regress even though it's ASCII with non English frequencies because the pattern has chars that are not particularly common in English, so it uses the old codepath. It's hard to hypothesize about what one might search for in a binary file; searching for a bunch of leading lower case ASCII chars might regress somewhat. We've never particularly tested on binary before, and I don't recall seeing any bugs mentioning binary, so I don't think this is particularly interesting. NonASCII didn't regress since as previously mentioned, the non ASCII leading chars in the pattern (presumably likely for any searching of non ASCII text) causes it to choose the existing codepath. All MannWhitney tests report Same -- no regressions on binary or non-ASCII input. --------- Co-authored-by: Copilot <[email protected]>

danmoseley mentioned this pull request Feb 23, 2026

Enable case-sensitive LeadingStrings with frequency-based heuristic dotnet/runtime#124736

Merged

danmoseley requested a review from Copilot February 23, 2026 04:41

Copilot started reviewing on behalf of danmoseley February 23, 2026 04:42 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs Outdated Show resolved Hide resolved

danmoseley requested a review from stephentoub February 24, 2026 00:09

stephentoub reviewed Feb 24, 2026

View reviewed changes

src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs Outdated Show resolved Hide resolved

stephentoub reviewed Feb 24, 2026

View reviewed changes

src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs Outdated Show resolved Hide resolved

danmoseley and others added 3 commits February 23, 2026 20:26

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns

56770ad

Co-authored-by: Copilot <[email protected]>

Remove binary data benchmark class

9b4a3a0

Remove Perf_Regex_LeadingStrings_BinaryData, keeping only the non-ASCII benchmark. Co-authored-by: Copilot <[email protected]>

danmoseley force-pushed the regex-redux/benchmarks branch from 3a03edd to a3f6116 Compare February 24, 2026 05:06

stephentoub approved these changes Feb 24, 2026

View reviewed changes

danmoseley merged commit 24575dc into dotnet:main Feb 24, 2026
74 checks passed

danmoseley deleted the regex-redux/benchmarks branch February 24, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns#5126

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns#5126
danmoseley merged 3 commits intodotnet:mainfrom
danmoseley:regex-redux/benchmarks

danmoseley commented Feb 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danmoseley commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

danmoseley commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danmoseley commented Feb 23, 2026 •

edited

Loading