perf: per-goroutine DFA cache + AC prefilter + integrated loop#146
Merged
Conversation
DFA struct is now immutable after Compile — safe to share across goroutines. DFACache holds all mutable state: transition cache, state list, start table. All DFA methods now accept cache *DFACache as first parameter. No mutex in DFACache — single owner (pooled per goroutine in Phase 2). DFA.NewCache() creates empty cache for pool allocation. meta/ package will not compile until Phase 2 integrates DFACache.
All meta/ DFA call sites now use pooled DFACache from SearchState. Each goroutine gets its own DFACache — no shared mutable state. Key changes: - SearchState: added dfaCache, revDFACache fields - searchStatePool: creates DFACache via DFA.NewCache() - isMatchDFA: restored DFA verification (removed PikeVM workaround) - Reverse searchers: own sync.Pool for DFA caches - All 26+ DFA call sites pass state.dfaCache Thread-safe: DFA immutable (shared), DFACache mutable (pooled per goroutine).
checkWordBoundaryMatch created Builder + resolved word boundaries PER BYTE in DFA search loop = 30% CPU on patterns with \b. Fix: pre-compute matchAtWordBoundary and matchAtNonWordBoundary flags during determinize. checkWordBoundaryFast = O(1) flag check, no allocation. IsMatch for (?iU)\b(eval|system|...|phpinfo)\b on 1KB match case: 49μs → 29μs (1.7x faster). 6.4x faster than stdlib (was 3.4x). Gap to Rust: 45x → 26x.
isMatchDFA called prefilter.Find separately, then DFA.IsMatch which internally calls isMatchWithPrefilter → prefilter.Find again. Removed redundant prefilter call — DFA handles it internally.
FatTeddy's 2-byte fingerprints cause catastrophic bucket collisions with many short case-fold patterns (60 literals from (?i)eval|system|exec|...). Each SIMD scan returns multiple false-positive candidates requiring bytes.Equal verification — 89% CPU in prefilter. Replace with Aho-Corasick DFA prefilter for >32 patterns: - Zero false positives via deterministic state machine - O(n) flat transition table scan - No Go/ASM round-trip overhead Results (264B input, case-insensitive pattern): - IsMatch match: 13.4us -> 1.9us (7x faster) - IsMatch no-match: 8.3us -> 600ns (14x faster) - IsMatch no-match 33B: 1.2us -> 91ns (13x faster) - vs Rust: 17x slower -> 2.4-5.6x slower Slim Teddy (<=32 patterns) unchanged — low collision rate.
- CHANGELOG: add v0.12.15 (per-goroutine DFA cache, word boundary optimization, AC DFA prefilter, double prefilter elimination) - ROADMAP: update version, milestones v0.12.9-v0.12.15, replace Fat Teddy with AC DFA prefilter in metrics table - README: update strategy table and multi-pattern description
Replace two-pass isMatchWithPrefilter (prefilter.Find -> DFA.searchAnchored -> repeat) with single integrated loop where dead-state DFA transitions trigger prefilter skip-ahead. Eliminates function call overhead between passes and redundant start-state setup on each candidate. Reference: Rust regex-automata hybrid/search.rs find_fwd_imp. Results ((?iU)\b(eval|system|exec|...)\b, 264B input): - match: 1870ns -> 1490ns (1.25x faster) - match 33B: 596ns -> 443ns (1.35x faster) - no-match 33B: 91ns -> 67ns (1.36x faster) - vs Rust: 5.6x -> 2.5-6.9x slower (was 17-100x)
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Benchmark ComparisonComparing Summary:
|
AC prefilter has O(n²) degradation on FindAll/Count when start bytes are common in haystack (e.g., (?i)get|post|put → start bytes G,g,P,p,D,d). AC's findEarliestStartByte calls bytes.IndexByte for each start byte, which degrades when most positions match. http_methods Count regression: 77ms → 18s (235x slower with AC). Restored: FatTeddy for 33-64 patterns, AC only for >64. Issue #137 (60 literals) stays on FatTeddy but still gets 8x speedup from word boundary pre-computation + integrated prefilter loop.
findIndicesBidirectionalDFALongest called getSearchState() internally, but callers from findIndicesBoundedBacktrackerAtWithState already had a pooled state from findAllIndicesLoop. This caused redundant pool.Get/Put on every FindAll iteration for BT overflow patterns. Pass existing state via variadic parameter to avoid double pool access. anchored FindAll 6MB: 179ns -> 101ns (1.8x faster) word_repeat FindAll 6MB: 206ms/157allocs -> 170ms/34allocs (-17%, -78% allocs)
NewCache() pre-allocated map[StateKey]*State with capacity 10,000 (MaxStates default). Each pooled cache wasted ~400KB on cold start. For patterns with few DFA states or BoundedBacktracker overflow fallback, this dominated search time on CI runners. Start with capacity 64, grow on demand via Go's map/slice growth.
…ll pool.Get Reverse searchers (ReverseSuffix, ReverseInner, ReverseSuffixSet, MultilineReverseSuffix) had their own sync.Pool for DFA caches, causing pool.Get/Put on every FindIndicesAt call in FindAll loops. Now strategy-specific DFA caches live in SearchState (created once per pool.New) and passed through findIndicesAtWithState via WithCaches methods. Eliminates N × pool.Get/Put per FindAll iteration. Also adds searchStateConfig to cleanly pass all DFA references through pool construction.
Compares IsMatch, Find, FindAllIndex, and Count results against Go stdlib regexp for 38 patterns (regex-bench + LangArena + edge cases). 32 patterns pass, 6 skipped as known pre-existing divergences: - greedy .* match boundaries (inner_literal, suffix) - (?m) capture group FindAll boundaries (http_methods, la_methods) - case-insensitive alternation boundaries (la_suspicious) - .* newline handling (dot_star) These divergences exist on main too — not regressions from this PR. Any NEW divergence will be caught immediately in CI.
…ompat test matchStartZero path in ReverseSuffix and ReverseSuffixSet did not account for . (AnyCharNotNL) not matching \n. Pattern .*@example.com on multi-line input returned match crossing newline boundaries. Fix: lineStartBefore() finds start of the line containing the suffix, constraining .* match to single line (matching stdlib behavior). Also adds TestStdlibCompatibility: 38 patterns (regex-bench + LangArena + edge cases) compared against stdlib regexp for IsMatch, Find, FindAllIndex, and Count. 34 pass, 4 skipped as pre-existing divergences (capture group FindAll count, .* FindAll — exist on main too).
3 correctness fixes: 1. UseTeddy selected for (?m)^(GET|POST|...) — Teddy doesn't verify ^ anchor, returns matches mid-line. Fix: hasAnchorAssertions() check prevents UseTeddy when pattern has anchors (^, $, \b). 2. ReverseSuffix/ReverseSuffixSet matchStartZero crossed \n boundaries. Pattern .*@example.com on multi-line input matched from line 1 across newline into line 2. Fix: lineStartBefore() constrains .* to single line. 3. la_suspicious (?i) case-insensitive DFA misses alternation branches — pre-existing, skipped in test. Stdlib compatibility test: 36 PASS, 2 SKIP (pre-existing: .* empty match semantics, (?i) DFA alternation).
…, DFA cache thrash 1. UseTeddy ignoring anchors: (?m)^(GET|POST) matched mid-line because Teddy doesn't verify anchor constraints. Fix: hasAnchorAssertions() prevents UseTeddy for patterns with ^, $, \b. 2. .* newline boundary: ReverseSuffix/ReverseSuffixSet matchStartZero crossed \n. Fix: lineStartBefore() constrains to single line. 3. canMatchEmpty FindAll: .* with UseBoth caused DFA cache clear mid-loop, losing position. Fix: canMatchEmpty patterns use UseNFA. 4. Large (?i) NFA DFA thrash: (?i) alternation with 549 NFA states thrashed DFA cache, missing alternation branches. Fix: nfaSize>200 with incomplete literals falls back to UseNFA. 5. Stdlib compat test: 37 PASS, 1 SKIP (PikeVM (?i) alternation overlap — exec prefix overlaps execute, needs PikeVM investigation). All fixes verified against Go stdlib regexp on 38 patterns.
DFA and Teddy can't verify multiline line anchors ((?m)^). Pattern (?m)^(GET|POST|...) with UseTeddy returned matches mid-line (correctness bug). With UseDFA, prefilter.IsComplete()=true skipped anchor verification. Fix: 1. WrapIncomplete forces prefilter.IsComplete()=false when pattern has anchors — engine always verifies candidate positions. 2. hasMultilineLineAnchor() detects (?m)^ patterns and routes to UseNFA which correctly handles line-start assertions. 3. adjustForAnchors() extracted to reduce CompileRegexp complexity. http_methods on 6MB: 73ms (UseDFA) -> 1.19ms (UseNFA+prefilter) = Rust parity.
extractPrefixesAlternate truncated alternation branches when case-fold cross-product exceeded CrossProductLimit (250). This produced a partial prefilter covering only some branches — FindAll missed matches for uncovered branches (e.g., base64_decode, phpinfo in la_suspicious pattern). Fix (Rust approach): when alternation overflow occurs, return empty Seq instead of truncated set. No prefilter built — NFA handles all branches. Also fix expandCaseFoldLiteral: use filledCount instead of len(runes) when early exit leaves trailing nil entries in foldSets. Stdlib compat test: 38/38 PASS, 0 SKIP.
fc3028e to
4bab181
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per-goroutine DFA cache (Rust approach) + AC DFA prefilter + integrated prefilter loop.
Fixes concurrent data races on
IsMatch/Find/FindSubmatchand eliminates 89% CPU overhead in FatTeddy prefilter for case-insensitive patterns with many literals.Changes
Architecture (Rust approach)
DFA(shared) + mutableDFACache(pooled viasync.Pool)sync.RWMutexmeta/updated to pass pooled cachePerformance
determinize()— 30% → 0.3% CPUisMatchDFAResults —
(?iU)\b(eval|system|exec|execute|passthru|shell_exec|phpinfo)\bCloses #137
Test plan
go test ./...— all packages passgo test -race ./...— no data races (CGO not available on Windows, tested via CI)golangci-lint run— 0 issuesgofmt -l .— clean