fix: FatTeddy AVX2 + prefilter acceleration + AC v0.2.1 upgrade#143
Merged
Conversation
teddyFatAVX2_2 used ANDL to combine low/high 128-bit lane candidate masks. This required a match in BOTH lanes, but patterns assigned to a single lane (e.g., GET variants in buckets 0-7 = low lane only) were zeroed out. Fix: ORL — candidate exists if EITHER lane has a match. Also adds DECQ SI at handle_tail to cover prev0 carry-over position.
- TestFatTeddyAVX2_SingleLaneBuckets: verifies GET/POST/PUT found with patterns distributed across specific lanes - TestFatTeddyCaseFoldRegression: validates match count vs stdlib for (?i)get|post|put via meta-engine (750/750 on 1000 log lines)
FatTeddy AVX2 is now correct (ORL fix). Remove: - buildACPrefilter: replaced ALL FatTeddy with AC (workaround) - ac_prefilter.go: AC wrapper type (unused) - fatTeddyFallback path in findIndicesTeddyAt (FatTeddy handles all sizes) LangArena methods (?i)get|post|put: AC 41ms → FatTeddy 11ms
isMatchDFA now loops through prefilter candidates and verifies each with anchored DFA instead of single unanchored PikeVM scan from first candidate. Issue #137 match case: 176us → 27us (6.5x faster, 3.7x faster than stdlib)
…states) For patterns with >100 NFA states (e.g., (?i) case-fold), bidirectional DFA cache-thrashes. Prefilter candidate loop with anchored DFA verification avoids building full DFA state table. Guard: nfaStateCount > 100 — small NFAs (34-57 states) use bidirectional DFA directly (no cache thrashing, lower overhead than candidate loop). Also adds nfaStateCount field to Engine for the guard check.
When >64 literals extracted, try cascading trim: keep 4 bytes → dedup, if still >64 → keep 3 bytes → dedup, if still >64 → keep 2 bytes → dedup. Falls back to original if trimming can't reduce below 64. Matches Rust's optimize_for_prefix_by_preference ATTEMPTS table. auth_attempts (?i)/login|/signin: 128 literals → 18 four-byte → Teddy. 34ms → 10ms.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Benchmark ComparisonComparing Summary:
|
…etection VPTEST Y6,Y6 (1 instruction, sets ZF if all zero) replaces: VPXOR+VPCMPEQB+VPMOVMSKB+NOTL+MOVL+ANDL+SHRL+ORL (8 instructions). ORL lane combining moved to found_candidate cold path (only on actual candidates, not every 16-byte chunk). FatTeddy 35 patterns on 6MB: 55ms → 41ms (24% faster).
Batch ASM: fatTeddyAVX2_2_batch scans entire haystack in one call, writes all (pos, bucketMask) candidates to pre-allocated buffer. Eliminates Go→ASM round trip overhead (11.6 GB/s SIMD vs 283 MB/s with round trips). FindAllPositions: batch-based FindAll using single ASM call. Find: uses original per-candidate path (batch too slow for first-match).
Flat DFA with premultiplied state IDs, match flag in high bit, SIMD skip-ahead prefilter via bytes.IndexByte. Throughput: Find 3.4 GB/s, IsMatch 5.9-7.0 GB/s (was 260-545 MB/s).
248a57d to
48f0a4f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FatTeddy correctness fix, prefilter acceleration, prefix optimization, AC upgrade.
Fixes
Performance
Dependencies
LangArena total: 757ms → 144ms (5.3x faster). Gap to Rust: 13x → 2.5x.
Test plan