Skip to content

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns#5126

Merged
danmoseley merged 3 commits intodotnet:mainfrom
danmoseley:regex-redux/benchmarks
Feb 24, 2026
Merged

Add LeadingStrings benchmarks for binary and non-ASCII regex patterns#5126
danmoseley merged 3 commits intodotnet:mainfrom
danmoseley:regex-redux/benchmarks

Conversation

@danmoseley
Copy link
Member

@danmoseley danmoseley commented Feb 23, 2026

Adds two new benchmark classes to exercise the LeadingStrings vs FixedDistanceSets heuristic in the regex engine:

  • Perf_Regex_LeadingStrings_BinaryData: 1MB binary corpus (PE-header-like seed duplicated), alternation of binary patterns. Validates no regression on non-text input. (Lots of ASCII here, but obviously not English frequencies.)
  • Perf_Regex_LeadingStrings_NonAscii: ~100KB Russian text (Anna Karenina opening), alternation of Russian words. Validates no regression on non-ASCII text where the frequency heuristic bails out.

Companion to dotnet/runtime change: dotnet/runtime#124736

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two new benchmark classes to test regex alternation pattern performance with different input types. The benchmarks are designed to validate that the LeadingStrings vs FixedDistanceSets heuristic in the regex engine doesn't regress on binary and non-ASCII data.

Changes:

  • Added Perf_Regex_LeadingStrings_BinaryData class with benchmarks for binary data patterns (1MB PE-header-like corpus)
  • Added Perf_Regex_LeadingStrings_NonAscii class with benchmarks for Russian text patterns (~100KB from Anna Karenina)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

danmoseley and others added 3 commits February 23, 2026 20:26
Remove Perf_Regex_LeadingStrings_BinaryData, keeping only the non-ASCII benchmark.

Co-authored-by: Copilot <[email protected]>
Port curated/01-literal and curated/02-literal-alternate benchmarks from
https://github.com/BurntSushi/rebar for Russian and Chinese text.
Haystacks are OpenSubtitles data (https://opus.nlpl.eu/OpenSubtitles-v2018.php).

This replaces the ad-hoc NonAscii benchmark with well-established cross-engine
regex benchmarks that test literal and alternation search on non-ASCII text.

Co-authored-by: Copilot <[email protected]>
@danmoseley danmoseley force-pushed the regex-redux/benchmarks branch from 3a03edd to a3f6116 Compare February 24, 2026 05:06
@danmoseley
Copy link
Member Author

Benchmark results comparing main (base) vs 21a68d75 (test: "Skip frequency heuristic for single-char FixedDistanceSets")

Built from dotnet/runtime repo. Mann-Whitney U test, two-sided.

Lang Pattern Options Base(us) Test(us) Ratio p-value Sig
Chinese literal Compiled 25.45 25.80 1.01 0.490
Chinese literal NonBacktracking 24.27 14.94 0.62 <0.001 ***
Chinese literal None 21.92 23.07 1.05 0.001 **
Chinese alternation Compiled 25.13 27.31 1.09 0.044 *
Chinese alternation NonBacktracking 31.03 28.87 0.93 0.378
Chinese alternation None 45.93 44.64 0.97 0.023 *
Russian (?i) literal Compiled 110.39 110.95 1.00 0.322
Russian (?i) literal NonBacktracking 161.59 160.01 0.99 0.106
Russian (?i) literal None 190.67 191.38 1.00 0.120
Russian (?i) alternation Compiled 1,126.60 1,152.90 1.02 0.044 *
Russian (?i) alternation NonBacktracking 7,629.45 7,767.19 1.02 <0.001 ***
Russian (?i) alternation None 11,849.64 11,796.66 1.00 0.071
Russian literal Compiled 48.79 49.88 1.02 0.543
Russian literal NonBacktracking 81.00 88.57 1.09 0.102
Russian literal None 58.27 57.69 0.99 0.867
Russian alternation Compiled 1,214.60 1,228.94 1.01 0.086
Russian alternation NonBacktracking 2,331.08 3,208.26 1.38 <0.001 ***
Russian alternation None 3,074.01 3,164.35 1.03 0.017 *

* p<0.05, ** p<0.01, *** p<0.001

@danmoseley
Copy link
Member Author

Hmm, speedups are nice except I wouldn't expect any as this is non ASCII

@danmoseley
Copy link
Member Author

Reran with more iterations and the tests are stable and at parity. The issue is some of them are super short (a few MB all in CPU cache doing IndexOf with SIMD) so I needed to do more iterations, which I guess the perf lab would. Merging.

@danmoseley danmoseley merged commit 24575dc into dotnet:main Feb 24, 2026
74 checks passed
@danmoseley danmoseley deleted the regex-redux/benchmarks branch February 24, 2026 05:58
danmoseley added a commit to dotnet/runtime that referenced this pull request Feb 25, 2026
…124736)

I thought it would be interesting to see whether AI could take another
look at the [commented out search
strategy](https://github.com/dotnet/runtime/blob/99b76018b6e4/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs#L152-L161)
originally introduced by @stephentoub in #98791 to see whether we can
enable it and keep the wins without the regressions that caused it to be
commented out.

AI tried various experiments, and got to a dead end. I recalled the
frequency table approach that Ripgrep uses (credit to @BurntSushi).
Turns out that fixes the regressions entirely. This means our engine now
has assumptions built in about char frequencies in ASCII (only) text.
That's an approach that's been proven in ripgrep, one of the fastest
engines, for 10 years, and it turns out to work OK for regex-redux as
well because a, c, g, t are relatively high frequency in English anyway.
Code unchanged if pattern has anything other than ASCII (see benchmark
results below).

This gives us a nice win on regex-redux, a few other wins in existing
tests, and no regressions.

Note: a char frequency table already existed in `RegexPrefixAnalyzer.cs`
for ranking which fixed-distance character sets are most selective. Our
new table serves a different purpose: deciding whether to use
`LeadingStrings` vs `FixedDistanceSets` at all. The two are
complementary.

====

When a regex has multiple alternation prefixes (e.g. `a|b|c|...`), this
change decides whether to use `SearchValues<string>`
(Teddy/Aho-Corasick) or fall through to `FixedDistanceSets`
(`IndexOfAny`) based on the frequency of the starting characters.

High-frequency starters (common letters like lowercase vowels) benefit
from multi-string search; low-frequency starters (uppercase, digits,
rare consonants) are already excellent `IndexOfAny` filters. Non-ASCII
starters bail out (no frequency data), preserving baseline behavior.

## Benchmark results (444 benchmarks, BDN A/B with --statisticalTest
3ms)

| Benchmark | Baseline | PR | Ratio | Verdict |
|-----------|----------|-----|-------|---------|
| RegexRedux_1 (Compiled) | 25.77ms | 14.27ms | **1.81x faster** |
Faster |
| Leipzig Tom.*river (Compiled) | 6.13ms | 1.87ms | **3.28x faster** |
Faster |
| RegexRedux_5 (Compiled) | 2.83ms | 2.35ms | **1.20x faster** | Faster
|
| Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice | -- | -- | --
| Same |
| LeadingStrings_NonAscii (all variants) | -- | -- | -- | Same |
| LeadingStrings_BinaryData (all variants) | -- | -- | -- | Same |

Leipzig win is because the pattern is
`Tom.{10,25}river|river.{10,25}Tom` so there is a short prefix that is
common in the text; with this change it notices `r` is common and `T`
fairly common in English text, so it switches to `SearchValues` which
looks for `Tom` and `river` simultaneously, causing far fewer false
starts.

regex-redux win is because it was previously looking for short, very
common prefixes naively, and now (specifically because the pattern chars
are common) it changed to use `SearchValues` (ie Aho-Corasick/Teddy) to
search for the longer strings simultaneously.

No regressions detected. All MannWhitney tests report Same for
non-improved benchmarks.

## Key design decisions

- **Frequency table**: First 128 entries of Rust's `BYTE_FREQUENCIES`
from @BurntSushi's aho-corasick crate
- **Threshold**: Average rank >= 200 triggers `LeadingStrings`; below
200 falls through to `FixedDistanceSets`
- **Non-ASCII**: Returns false (no frequency data), so the heuristic
does not engage and behavior is unchanged

Companion benchmarks: dotnet/performance#5126

## New benchmark results (not yet in dotnet/performance, won't be picked
up by PR bot)

These benchmarks are from the companion PR
dotnet/performance#5126.

```
BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley)
Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores
```

| Benchmark | Options | Baseline | PR | Ratio | MannWhitney(3ms) |
|-----------|---------|----------|-----|-------|------------------|
| LeadingStrings_BinaryData | None | 4,483 us | 4,365 us | 0.97 | Same |
| LeadingStrings_BinaryData | Compiled | 2,188 us | 2,184 us | 1.00 |
Same |
| LeadingStrings_BinaryData | NonBacktracking | 3,734 us | 3,725 us |
1.00 | Same |
| LeadingStrings_NonAscii Count | None | 913 us | 956 us | 1.05 | Same |
| LeadingStrings_NonAscii Count | Compiled | 244 us | 243 us | 1.00 |
Same |
| LeadingStrings_NonAscii CountIgnoreCase | None | 1,758 us | 1,714 us |
0.98 | Same |
| LeadingStrings_NonAscii CountIgnoreCase | Compiled | 258 us | 250 us |
0.97 | Same |
| LeadingStrings_NonAscii Count | NonBacktracking | 392 us | 398 us |
1.02 | Same |
| LeadingStrings_NonAscii CountIgnoreCase | NonBacktracking | 409 us |
431 us | 1.05 | Same |

Binary didn't regress even though it's ASCII with non English
frequencies because the pattern has chars that are not particularly
common in English, so it uses the old codepath. It's hard to hypothesize
about what one might search for in a binary file; searching for a bunch
of leading lower case ASCII chars might regress somewhat. We've never
particularly tested on binary before, and I don't recall seeing any bugs
mentioning binary, so I don't think this is particularly interesting.

NonASCII didn't regress since as previously mentioned, the non ASCII
leading chars in the pattern (presumably likely for any searching of non
ASCII text) causes it to choose the existing codepath.

All MannWhitney tests report Same -- no regressions on binary or
non-ASCII input.

---------

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants