Skip to content

perf: per-goroutine DFA cache + AC prefilter + integrated loop#146

Merged
kolkov merged 20 commits into
mainfrom
feature/per-goroutine-dfa-cache
Mar 21, 2026
Merged

perf: per-goroutine DFA cache + AC prefilter + integrated loop#146
kolkov merged 20 commits into
mainfrom
feature/per-goroutine-dfa-cache

Conversation

@kolkov

@kolkov kolkov commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Per-goroutine DFA cache (Rust approach) + AC DFA prefilter + integrated prefilter loop.

Fixes concurrent data races on IsMatch/Find/FindSubmatch and eliminates 89% CPU overhead in FatTeddy prefilter for case-insensitive patterns with many literals.

Changes

Architecture (Rust approach)

  • Split lazy DFA into immutable DFA (shared) + mutable DFACache (pooled via sync.Pool)
  • Per-goroutine cache: zero contention, no sync.RWMutex
  • All 14+ DFA call sites in meta/ updated to pass pooled cache

Performance

  • Pre-computed word boundary match flags during determinize()30% → 0.3% CPU
  • Aho-Corasick DFA prefilter for >32 literals — replaces FatTeddy (zero false positives)
  • Integrated prefilter+DFA loop — single scan instead of two-pass
  • Eliminated double prefilter scan in isMatchDFA

Results — (?iU)\b(eval|system|exec|execute|passthru|shell_exec|phpinfo)\b

Case v0.12.14 v0.12.15 Speedup vs Rust
match 264B 13,400 ns 1,490 ns 9x 2.7x slower
no-match 264B 8,300 ns 537 ns 15x 8.4x slower
match 33B 1,690 ns 443 ns 3.8x 6.9x slower
no-match 33B 1,210 ns 67 ns 18x 2.5x slower
concurrent data race 630 ns fixed

Closes #137

Test plan

  • go test ./... — all packages pass
  • go test -race ./... — no data races (CGO not available on Windows, tested via CI)
  • golangci-lint run — 0 issues
  • gofmt -l . — clean
  • Benchmarks: no regressions across all strategies (Find, IsMatch, Reverse*, CharClass, AC, Digit)
  • CI: 3 platforms (ubuntu, macos-arm64, windows) + lint + 386 build
  • regex-bench: Go + Rust comparison on AMD EPYC

kolkov added 8 commits March 20, 2026 10:54
DFA struct is now immutable after Compile — safe to share across goroutines.
DFACache holds all mutable state: transition cache, state list, start table.

All DFA methods now accept cache *DFACache as first parameter.
No mutex in DFACache — single owner (pooled per goroutine in Phase 2).
DFA.NewCache() creates empty cache for pool allocation.

meta/ package will not compile until Phase 2 integrates DFACache.
All meta/ DFA call sites now use pooled DFACache from SearchState.
Each goroutine gets its own DFACache — no shared mutable state.

Key changes:
- SearchState: added dfaCache, revDFACache fields
- searchStatePool: creates DFACache via DFA.NewCache()
- isMatchDFA: restored DFA verification (removed PikeVM workaround)
- Reverse searchers: own sync.Pool for DFA caches
- All 26+ DFA call sites pass state.dfaCache

Thread-safe: DFA immutable (shared), DFACache mutable (pooled per goroutine).
checkWordBoundaryMatch created Builder + resolved word boundaries PER BYTE
in DFA search loop = 30% CPU on patterns with \b.

Fix: pre-compute matchAtWordBoundary and matchAtNonWordBoundary flags during
determinize. checkWordBoundaryFast = O(1) flag check, no allocation.

IsMatch for (?iU)\b(eval|system|...|phpinfo)\b on 1KB match case:
49μs → 29μs (1.7x faster). 6.4x faster than stdlib (was 3.4x).
Gap to Rust: 45x → 26x.
isMatchDFA called prefilter.Find separately, then DFA.IsMatch which
internally calls isMatchWithPrefilter → prefilter.Find again.
Removed redundant prefilter call — DFA handles it internally.
FatTeddy's 2-byte fingerprints cause catastrophic bucket collisions with
many short case-fold patterns (60 literals from (?i)eval|system|exec|...).
Each SIMD scan returns multiple false-positive candidates requiring
bytes.Equal verification — 89% CPU in prefilter.

Replace with Aho-Corasick DFA prefilter for >32 patterns:
- Zero false positives via deterministic state machine
- O(n) flat transition table scan
- No Go/ASM round-trip overhead

Results (264B input, case-insensitive pattern):
- IsMatch match: 13.4us -> 1.9us (7x faster)
- IsMatch no-match: 8.3us -> 600ns (14x faster)
- IsMatch no-match 33B: 1.2us -> 91ns (13x faster)
- vs Rust: 17x slower -> 2.4-5.6x slower

Slim Teddy (<=32 patterns) unchanged — low collision rate.
- CHANGELOG: add v0.12.15 (per-goroutine DFA cache, word boundary
  optimization, AC DFA prefilter, double prefilter elimination)
- ROADMAP: update version, milestones v0.12.9-v0.12.15, replace
  Fat Teddy with AC DFA prefilter in metrics table
- README: update strategy table and multi-pattern description
Replace two-pass isMatchWithPrefilter (prefilter.Find -> DFA.searchAnchored
-> repeat) with single integrated loop where dead-state DFA transitions
trigger prefilter skip-ahead. Eliminates function call overhead between
passes and redundant start-state setup on each candidate.

Reference: Rust regex-automata hybrid/search.rs find_fwd_imp.

Results ((?iU)\b(eval|system|exec|...)\b, 264B input):
- match: 1870ns -> 1490ns (1.25x faster)
- match 33B: 596ns -> 443ns (1.35x faster)
- no-match 33B: 91ns -> 67ns (1.36x faster)
- vs Rust: 5.6x -> 2.5-6.9x slower (was 17-100x)
@codecov

codecov Bot commented Mar 20, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Mar 20, 2026

Copy link
Copy Markdown

Benchmark Comparison

Comparing main → PR #146

Summary: geomean 118.3n 120.1n +1.55%

⚠️ Potential regressions detected:

Accelerate/memchr2-4       209.9n ± ∞ ¹   210.4n ± ∞ ¹  +0.24% (p=0.032 n=5)
LazyDFAAlternation-4       64.50n ± ∞ ¹   69.22n ± ∞ ¹  +7.32% (p=0.008 n=5)
LazyDFARepetition-4        87.79n ± ∞ ¹   91.14n ± ∞ ¹  +3.82% (p=0.008 n=5)
geomean                    118.3n         120.1n        +1.55%
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AhoCorasickManyPatterns/coregex_10_patterns-4           60.22n ± ∞ ¹     61.29n ± ∞ ¹      +1.78% (p=0.008 n=5)
AhoCorasickManyPatterns/coregex_50_patterns-4           75.53n ± ∞ ¹     79.83n ± ∞ ¹      +5.69% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

kolkov added 12 commits March 20, 2026 19:47
AC prefilter has O(n²) degradation on FindAll/Count when start bytes are
common in haystack (e.g., (?i)get|post|put → start bytes G,g,P,p,D,d).
AC's findEarliestStartByte calls bytes.IndexByte for each start byte,
which degrades when most positions match.

http_methods Count regression: 77ms → 18s (235x slower with AC).
Restored: FatTeddy for 33-64 patterns, AC only for >64.

Issue #137 (60 literals) stays on FatTeddy but still gets 8x speedup
from word boundary pre-computation + integrated prefilter loop.
findIndicesBidirectionalDFALongest called getSearchState() internally,
but callers from findIndicesBoundedBacktrackerAtWithState already had
a pooled state from findAllIndicesLoop. This caused redundant pool.Get/Put
on every FindAll iteration for BT overflow patterns.

Pass existing state via variadic parameter to avoid double pool access.

anchored FindAll 6MB: 179ns -> 101ns (1.8x faster)
word_repeat FindAll 6MB: 206ms/157allocs -> 170ms/34allocs (-17%, -78% allocs)
NewCache() pre-allocated map[StateKey]*State with capacity 10,000
(MaxStates default). Each pooled cache wasted ~400KB on cold start.
For patterns with few DFA states or BoundedBacktracker overflow fallback,
this dominated search time on CI runners.

Start with capacity 64, grow on demand via Go's map/slice growth.
…ll pool.Get

Reverse searchers (ReverseSuffix, ReverseInner, ReverseSuffixSet,
MultilineReverseSuffix) had their own sync.Pool for DFA caches, causing
pool.Get/Put on every FindIndicesAt call in FindAll loops.

Now strategy-specific DFA caches live in SearchState (created once per
pool.New) and passed through findIndicesAtWithState via WithCaches methods.
Eliminates N × pool.Get/Put per FindAll iteration.

Also adds searchStateConfig to cleanly pass all DFA references through
pool construction.
Compares IsMatch, Find, FindAllIndex, and Count results against Go stdlib
regexp for 38 patterns (regex-bench + LangArena + edge cases).

32 patterns pass, 6 skipped as known pre-existing divergences:
- greedy .* match boundaries (inner_literal, suffix)
- (?m) capture group FindAll boundaries (http_methods, la_methods)
- case-insensitive alternation boundaries (la_suspicious)
- .* newline handling (dot_star)

These divergences exist on main too — not regressions from this PR.
Any NEW divergence will be caught immediately in CI.
…ompat test

matchStartZero path in ReverseSuffix and ReverseSuffixSet did not account
for . (AnyCharNotNL) not matching \n. Pattern .*@example.com on multi-line
input returned match crossing newline boundaries.

Fix: lineStartBefore() finds start of the line containing the suffix,
constraining .* match to single line (matching stdlib behavior).

Also adds TestStdlibCompatibility: 38 patterns (regex-bench + LangArena +
edge cases) compared against stdlib regexp for IsMatch, Find, FindAllIndex,
and Count. 34 pass, 4 skipped as pre-existing divergences (capture group
FindAll count, .* FindAll — exist on main too).
3 correctness fixes:
1. UseTeddy selected for (?m)^(GET|POST|...) — Teddy doesn't verify
   ^ anchor, returns matches mid-line. Fix: hasAnchorAssertions() check
   prevents UseTeddy when pattern has anchors (^, $, \b).

2. ReverseSuffix/ReverseSuffixSet matchStartZero crossed \n boundaries.
   Pattern .*@example.com on multi-line input matched from line 1 across
   newline into line 2. Fix: lineStartBefore() constrains .* to single line.

3. la_suspicious (?i) case-insensitive DFA misses alternation branches —
   pre-existing, skipped in test.

Stdlib compatibility test: 36 PASS, 2 SKIP (pre-existing: .* empty match
semantics, (?i) DFA alternation).
…, DFA cache thrash

1. UseTeddy ignoring anchors: (?m)^(GET|POST) matched mid-line because
   Teddy doesn't verify anchor constraints. Fix: hasAnchorAssertions()
   prevents UseTeddy for patterns with ^, $, \b.

2. .* newline boundary: ReverseSuffix/ReverseSuffixSet matchStartZero
   crossed \n. Fix: lineStartBefore() constrains to single line.

3. canMatchEmpty FindAll: .* with UseBoth caused DFA cache clear mid-loop,
   losing position. Fix: canMatchEmpty patterns use UseNFA.

4. Large (?i) NFA DFA thrash: (?i) alternation with 549 NFA states
   thrashed DFA cache, missing alternation branches. Fix: nfaSize>200
   with incomplete literals falls back to UseNFA.

5. Stdlib compat test: 37 PASS, 1 SKIP (PikeVM (?i) alternation overlap
   — exec prefix overlaps execute, needs PikeVM investigation).

All fixes verified against Go stdlib regexp on 38 patterns.
DFA and Teddy can't verify multiline line anchors ((?m)^). Pattern
(?m)^(GET|POST|...) with UseTeddy returned matches mid-line (correctness
bug). With UseDFA, prefilter.IsComplete()=true skipped anchor verification.

Fix:
1. WrapIncomplete forces prefilter.IsComplete()=false when pattern has
   anchors — engine always verifies candidate positions.
2. hasMultilineLineAnchor() detects (?m)^ patterns and routes to UseNFA
   which correctly handles line-start assertions.
3. adjustForAnchors() extracted to reduce CompileRegexp complexity.

http_methods on 6MB: 73ms (UseDFA) -> 1.19ms (UseNFA+prefilter) = Rust parity.
extractPrefixesAlternate truncated alternation branches when case-fold
cross-product exceeded CrossProductLimit (250). This produced a partial
prefilter covering only some branches — FindAll missed matches for
uncovered branches (e.g., base64_decode, phpinfo in la_suspicious pattern).

Fix (Rust approach): when alternation overflow occurs, return empty Seq
instead of truncated set. No prefilter built — NFA handles all branches.

Also fix expandCaseFoldLiteral: use filledCount instead of len(runes)
when early exit leaves trailing nil entries in foldSets.

Stdlib compat test: 38/38 PASS, 0 SKIP.
@kolkov kolkov force-pushed the feature/per-goroutine-dfa-cache branch from fc3028e to 4bab181 Compare March 21, 2026 07:34
@kolkov kolkov merged commit 9dca741 into main Mar 21, 2026
8 checks passed
@kolkov kolkov deleted the feature/per-goroutine-dfa-cache branch March 21, 2026 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re.MatchString Benchmark cost too much time

1 participant