Comparing changes

* fix: FatTeddy AVX2 ANDL→ORL — correct lane combining for 33-64 patterns teddyFatAVX2_2 used ANDL to combine low/high 128-bit lane candidate masks. This required a match in BOTH lanes, but patterns assigned to a single lane (e.g., GET variants in buckets 0-7 = low lane only) were zeroed out. Fix: ORL — candidate exists if EITHER lane has a match. Also adds DECQ SI at handle_tail to cover prev0 carry-over position. * test: FatTeddy AVX2 regression tests — lane combining + case-fold - TestFatTeddyAVX2_SingleLaneBuckets: verifies GET/POST/PUT found with patterns distributed across specific lanes - TestFatTeddyCaseFoldRegression: validates match count vs stdlib for (?i)get|post|put via meta-engine (750/750 on 1000 log lines) * refactor: remove AC replacement and small-haystack fallback FatTeddy AVX2 is now correct (ORL fix). Remove: - buildACPrefilter: replaced ALL FatTeddy with AC (workaround) - ac_prefilter.go: AC wrapper type (unused) - fatTeddyFallback path in findIndicesTeddyAt (FatTeddy handles all sizes) LangArena methods (?i)get|post|put: AC 41ms → FatTeddy 11ms * perf: prefilter candidate loop in isMatchDFA with anchored verification isMatchDFA now loops through prefilter candidates and verifies each with anchored DFA instead of single unanchored PikeVM scan from first candidate. Issue #137 match case: 176us → 27us (6.5x faster, 3.7x faster than stdlib) * perf: prefilter candidate loop in findIndicesDFA for large NFA (>100 states) For patterns with >100 NFA states (e.g., (?i) case-fold), bidirectional DFA cache-thrashes. Prefilter candidate loop with anchored DFA verification avoids building full DFA state table. Guard: nfaStateCount > 100 — small NFAs (34-57 states) use bidirectional DFA directly (no cache thrashing, lower overhead than candidate loop). Also adds nfaStateCount field to Engine for the guard check. * perf: Rust-style cascading prefix trim for >64 literals When >64 literals extracted, try cascading trim: keep 4 bytes → dedup, if still >64 → keep 3 bytes → dedup, if still >64 → keep 2 bytes → dedup. Falls back to original if trimming can't reduce below 64. Matches Rust's optimize_for_prefix_by_preference ATTEMPTS table. auth_attempts (?i)/login|/signin: 128 literals → 18 four-byte → Teddy. 34ms → 10ms. * perf: VPTEST in FatTeddy hot loop — replace 8-instruction candidate detection VPTEST Y6,Y6 (1 instruction, sets ZF if all zero) replaces: VPXOR+VPCMPEQB+VPMOVMSKB+NOTL+MOVL+ANDL+SHRL+ORL (8 instructions). ORL lane combining moved to found_candidate cold path (only on actual candidates, not every 16-byte chunk). FatTeddy 35 patterns on 6MB: 55ms → 41ms (24% faster). * wip: FatTeddy batch ASM + FindAllPositions (intermediate save) Batch ASM: fatTeddyAVX2_2_batch scans entire haystack in one call, writes all (pos, bucketMask) candidates to pre-allocated buffer. Eliminates Go→ASM round trip overhead (11.6 GB/s SIMD vs 283 MB/s with round trips). FindAllPositions: batch-based FindAll using single ASM call. Find: uses original per-candidate path (batch too slow for first-match). * deps: upgrade ahocorasick v0.1.0 → v0.2.1 (DFA + SIMD prefilter, 11-22x) Flat DFA with premultiplied state IDs, match flag in high bit, SIMD skip-ahead prefilter via bytes.IndexByte. Throughput: Find 3.4 GB/s, IsMatch 5.9-7.0 GB/s (was 260-545 MB/s). * docs: release v0.12.13 — changelog, roadmap, lint fix * fix: add hasAVX2 and batch stubs for non-amd64 platforms (macOS ARM64)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Mar 18, 2026

This comparison is taking too long to generate.

Uh oh!