Add LeadingStrings benchmarks for binary and non-ASCII regex patterns#5126
Add LeadingStrings benchmarks for binary and non-ASCII regex patterns#5126danmoseley merged 3 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds two new benchmark classes to test regex alternation pattern performance with different input types. The benchmarks are designed to validate that the LeadingStrings vs FixedDistanceSets heuristic in the regex engine doesn't regress on binary and non-ASCII data.
Changes:
- Added
Perf_Regex_LeadingStrings_BinaryDataclass with benchmarks for binary data patterns (1MB PE-header-like corpus) - Added
Perf_Regex_LeadingStrings_NonAsciiclass with benchmarks for Russian text patterns (~100KB from Anna Karenina)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs
Outdated
Show resolved
Hide resolved
src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs
Outdated
Show resolved
Hide resolved
src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.LeadingStrings.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]>
Remove Perf_Regex_LeadingStrings_BinaryData, keeping only the non-ASCII benchmark. Co-authored-by: Copilot <[email protected]>
Port curated/01-literal and curated/02-literal-alternate benchmarks from https://github.com/BurntSushi/rebar for Russian and Chinese text. Haystacks are OpenSubtitles data (https://opus.nlpl.eu/OpenSubtitles-v2018.php). This replaces the ad-hoc NonAscii benchmark with well-established cross-engine regex benchmarks that test literal and alternation search on non-ASCII text. Co-authored-by: Copilot <[email protected]>
3a03edd to
a3f6116
Compare
|
Benchmark results comparing Built from
* p<0.05, ** p<0.01, *** p<0.001 |
|
Hmm, speedups are nice except I wouldn't expect any as this is non ASCII |
|
Reran with more iterations and the tests are stable and at parity. The issue is some of them are super short (a few MB all in CPU cache doing IndexOf with SIMD) so I needed to do more iterations, which I guess the perf lab would. Merging. |
…124736) I thought it would be interesting to see whether AI could take another look at the [commented out search strategy](https://github.com/dotnet/runtime/blob/99b76018b6e4/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFindOptimizations.cs#L152-L161) originally introduced by @stephentoub in #98791 to see whether we can enable it and keep the wins without the regressions that caused it to be commented out. AI tried various experiments, and got to a dead end. I recalled the frequency table approach that Ripgrep uses (credit to @BurntSushi). Turns out that fixes the regressions entirely. This means our engine now has assumptions built in about char frequencies in ASCII (only) text. That's an approach that's been proven in ripgrep, one of the fastest engines, for 10 years, and it turns out to work OK for regex-redux as well because a, c, g, t are relatively high frequency in English anyway. Code unchanged if pattern has anything other than ASCII (see benchmark results below). This gives us a nice win on regex-redux, a few other wins in existing tests, and no regressions. Note: a char frequency table already existed in `RegexPrefixAnalyzer.cs` for ranking which fixed-distance character sets are most selective. Our new table serves a different purpose: deciding whether to use `LeadingStrings` vs `FixedDistanceSets` at all. The two are complementary. ==== When a regex has multiple alternation prefixes (e.g. `a|b|c|...`), this change decides whether to use `SearchValues<string>` (Teddy/Aho-Corasick) or fall through to `FixedDistanceSets` (`IndexOfAny`) based on the frequency of the starting characters. High-frequency starters (common letters like lowercase vowels) benefit from multi-string search; low-frequency starters (uppercase, digits, rare consonants) are already excellent `IndexOfAny` filters. Non-ASCII starters bail out (no frequency data), preserving baseline behavior. ## Benchmark results (444 benchmarks, BDN A/B with --statisticalTest 3ms) | Benchmark | Baseline | PR | Ratio | Verdict | |-----------|----------|-----|-------|---------| | RegexRedux_1 (Compiled) | 25.77ms | 14.27ms | **1.81x faster** | Faster | | Leipzig Tom.*river (Compiled) | 6.13ms | 1.87ms | **3.28x faster** | Faster | | RegexRedux_5 (Compiled) | 2.83ms | 2.35ms | **1.20x faster** | Faster | | Sherlock, BinaryData, BoostDocs, Mariomkas, SliceSlice | -- | -- | -- | Same | | LeadingStrings_NonAscii (all variants) | -- | -- | -- | Same | | LeadingStrings_BinaryData (all variants) | -- | -- | -- | Same | Leipzig win is because the pattern is `Tom.{10,25}river|river.{10,25}Tom` so there is a short prefix that is common in the text; with this change it notices `r` is common and `T` fairly common in English text, so it switches to `SearchValues` which looks for `Tom` and `river` simultaneously, causing far fewer false starts. regex-redux win is because it was previously looking for short, very common prefixes naively, and now (specifically because the pattern chars are common) it changed to use `SearchValues` (ie Aho-Corasick/Teddy) to search for the longer strings simultaneously. No regressions detected. All MannWhitney tests report Same for non-improved benchmarks. ## Key design decisions - **Frequency table**: First 128 entries of Rust's `BYTE_FREQUENCIES` from @BurntSushi's aho-corasick crate - **Threshold**: Average rank >= 200 triggers `LeadingStrings`; below 200 falls through to `FixedDistanceSets` - **Non-ASCII**: Returns false (no frequency data), so the heuristic does not engage and behavior is unchanged Companion benchmarks: dotnet/performance#5126 ## New benchmark results (not yet in dotnet/performance, won't be picked up by PR bot) These benchmarks are from the companion PR dotnet/performance#5126. ``` BenchmarkDotNet v0.16.0-custom.20260127.101, Windows 11 (10.0.26100.7840/24H2/2024Update/HudsonValley) Intel Core i9-14900K 3.20GHz, 1 CPU, 32 logical and 24 physical cores ``` | Benchmark | Options | Baseline | PR | Ratio | MannWhitney(3ms) | |-----------|---------|----------|-----|-------|------------------| | LeadingStrings_BinaryData | None | 4,483 us | 4,365 us | 0.97 | Same | | LeadingStrings_BinaryData | Compiled | 2,188 us | 2,184 us | 1.00 | Same | | LeadingStrings_BinaryData | NonBacktracking | 3,734 us | 3,725 us | 1.00 | Same | | LeadingStrings_NonAscii Count | None | 913 us | 956 us | 1.05 | Same | | LeadingStrings_NonAscii Count | Compiled | 244 us | 243 us | 1.00 | Same | | LeadingStrings_NonAscii CountIgnoreCase | None | 1,758 us | 1,714 us | 0.98 | Same | | LeadingStrings_NonAscii CountIgnoreCase | Compiled | 258 us | 250 us | 0.97 | Same | | LeadingStrings_NonAscii Count | NonBacktracking | 392 us | 398 us | 1.02 | Same | | LeadingStrings_NonAscii CountIgnoreCase | NonBacktracking | 409 us | 431 us | 1.05 | Same | Binary didn't regress even though it's ASCII with non English frequencies because the pattern has chars that are not particularly common in English, so it uses the old codepath. It's hard to hypothesize about what one might search for in a binary file; searching for a bunch of leading lower case ASCII chars might regress somewhat. We've never particularly tested on binary before, and I don't recall seeing any bugs mentioning binary, so I don't think this is particularly interesting. NonASCII didn't regress since as previously mentioned, the non ASCII leading chars in the pattern (presumably likely for any searching of non ASCII text) causes it to choose the existing codepath. All MannWhitney tests report Same -- no regressions on binary or non-ASCII input. --------- Co-authored-by: Copilot <[email protected]>
Adds two new benchmark classes to exercise the LeadingStrings vs FixedDistanceSets heuristic in the regex engine:
Companion to dotnet/runtime change: dotnet/runtime#124736