Skip to content

fix: FatTeddy AVX2 + prefilter acceleration + AC v0.2.1 upgrade#143

Merged
kolkov merged 11 commits into
mainfrom
feature/fatteddy-prefilter-trim
Mar 18, 2026
Merged

fix: FatTeddy AVX2 + prefilter acceleration + AC v0.2.1 upgrade#143
kolkov merged 11 commits into
mainfrom
feature/fatteddy-prefilter-trim

Conversation

@kolkov

@kolkov kolkov commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

Summary

FatTeddy correctness fix, prefilter acceleration, prefix optimization, AC upgrade.

Fixes

  • FatTeddy AVX2 ANDL→ORL — lane combining bug missed single-lane patterns (GET, PUT). 11456 → 34368 matches.
  • FatTeddy DECQ SI — tail loop boundary at k*16 positions.

Performance

  • VPTEST hot loop — 8 instructions → 1 for candidate detection. 24% faster scan.
  • Batch FindAllPositions — one ASM call per 64KB chunk, all candidates in buffer. 39ms → 22ms (1.8x).
  • Prefilter-accelerated isMatch/FindIndices — candidate loop with anchored DFA for large NFAs (>100 states). re.MatchString Benchmark cost too much time #137 match case: 176μs → 27μs.
  • Cascading prefix trim — >64 literals trimmed via [(4,64),(3,64),(2,64)]. auth_attempts 128→18 → Teddy. 34ms → 7ms.

Dependencies

  • ahocorasick v0.1.0 → v0.2.1 — DFA + SIMD prefilter, 11-22x throughput.

LangArena total: 757ms → 144ms (5.3x faster). Gap to Rust: 13x → 2.5x.

Test plan

  • Full test suite passes
  • Lint clean (0 issues)
  • All 13 LangArena match counts verified vs stdlib
  • FatTeddy methods: 34368 matches ✅
  • CI passes
  • regex-bench validation

kolkov added 6 commits March 18, 2026 08:50
teddyFatAVX2_2 used ANDL to combine low/high 128-bit lane candidate masks.
This required a match in BOTH lanes, but patterns assigned to a single lane
(e.g., GET variants in buckets 0-7 = low lane only) were zeroed out.
Fix: ORL — candidate exists if EITHER lane has a match.

Also adds DECQ SI at handle_tail to cover prev0 carry-over position.
- TestFatTeddyAVX2_SingleLaneBuckets: verifies GET/POST/PUT found
  with patterns distributed across specific lanes
- TestFatTeddyCaseFoldRegression: validates match count vs stdlib
  for (?i)get|post|put via meta-engine (750/750 on 1000 log lines)
FatTeddy AVX2 is now correct (ORL fix). Remove:
- buildACPrefilter: replaced ALL FatTeddy with AC (workaround)
- ac_prefilter.go: AC wrapper type (unused)
- fatTeddyFallback path in findIndicesTeddyAt (FatTeddy handles all sizes)

LangArena methods (?i)get|post|put: AC 41ms → FatTeddy 11ms
isMatchDFA now loops through prefilter candidates and verifies each with
anchored DFA instead of single unanchored PikeVM scan from first candidate.

Issue #137 match case: 176us → 27us (6.5x faster, 3.7x faster than stdlib)
…states)

For patterns with >100 NFA states (e.g., (?i) case-fold), bidirectional DFA
cache-thrashes. Prefilter candidate loop with anchored DFA verification
avoids building full DFA state table.

Guard: nfaStateCount > 100 — small NFAs (34-57 states) use bidirectional
DFA directly (no cache thrashing, lower overhead than candidate loop).

Also adds nfaStateCount field to Engine for the guard check.
When >64 literals extracted, try cascading trim: keep 4 bytes → dedup,
if still >64 → keep 3 bytes → dedup, if still >64 → keep 2 bytes → dedup.
Falls back to original if trimming can't reduce below 64.

Matches Rust's optimize_for_prefix_by_preference ATTEMPTS table.
auth_attempts (?i)/login|/signin: 128 literals → 18 four-byte → Teddy.
34ms → 10ms.
@codecov

codecov Bot commented Mar 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 19.04762% with 85 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
prefilter/teddy_fat.go 15.68% 42 Missing and 1 partial ⚠️
meta/find_indices.go 0.00% 24 Missing and 1 partial ⚠️
literal/extractor.go 0.00% 10 Missing and 1 partial ⚠️
meta/ismatch.go 64.70% 5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Mar 18, 2026

Copy link
Copy Markdown

Benchmark Comparison

Comparing main → PR #143

Summary: geomean 117.7n 118.5n +0.68%

⚠️ Potential regressions detected:

Accelerate/memchr1-4       108.7n ± ∞ ¹   109.7n ± ∞ ¹  +0.92% (p=0.008 n=5)
LazyDFASimpleLiteral-4     60.15n ± ∞ ¹   62.02n ± ∞ ¹  +3.11% (p=0.008 n=5)
LazyDFARepetition-4        86.78n ± ∞ ¹   87.76n ± ∞ ¹  +1.13% (p=0.008 n=5)
geomean                    117.7n         118.5n        +0.68%
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean              33.56n         33.56n        +0.01%
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AhoCorasickVsStdlib/coregex_IsMatch-4                   1.748µ ± ∞ ¹   17.642µ ± ∞ ¹  +909.27% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

kolkov added 4 commits March 18, 2026 12:16
…etection

VPTEST Y6,Y6 (1 instruction, sets ZF if all zero) replaces:
VPXOR+VPCMPEQB+VPMOVMSKB+NOTL+MOVL+ANDL+SHRL+ORL (8 instructions).

ORL lane combining moved to found_candidate cold path (only on actual
candidates, not every 16-byte chunk).

FatTeddy 35 patterns on 6MB: 55ms → 41ms (24% faster).
Batch ASM: fatTeddyAVX2_2_batch scans entire haystack in one call,
writes all (pos, bucketMask) candidates to pre-allocated buffer.
Eliminates Go→ASM round trip overhead (11.6 GB/s SIMD vs 283 MB/s with round trips).

FindAllPositions: batch-based FindAll using single ASM call.
Find: uses original per-candidate path (batch too slow for first-match).
Flat DFA with premultiplied state IDs, match flag in high bit,
SIMD skip-ahead prefilter via bytes.IndexByte.

Throughput: Find 3.4 GB/s, IsMatch 5.9-7.0 GB/s (was 260-545 MB/s).
@kolkov kolkov changed the title fix: FatTeddy AVX2 correctness + prefilter acceleration + cascading trim fix: FatTeddy AVX2 + prefilter acceleration + AC v0.2.1 upgrade Mar 18, 2026
@kolkov kolkov force-pushed the feature/fatteddy-prefilter-trim branch from 248a57d to 48f0a4f Compare March 18, 2026 11:30
@kolkov kolkov merged commit 7a29fab into main Mar 18, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant