* fix: FatTeddy AVX2 ANDL→ORL — correct lane combining for 33-64 patterns
teddyFatAVX2_2 used ANDL to combine low/high 128-bit lane candidate masks.
This required a match in BOTH lanes, but patterns assigned to a single lane
(e.g., GET variants in buckets 0-7 = low lane only) were zeroed out.
Fix: ORL — candidate exists if EITHER lane has a match.
Also adds DECQ SI at handle_tail to cover prev0 carry-over position.
* test: FatTeddy AVX2 regression tests — lane combining + case-fold
- TestFatTeddyAVX2_SingleLaneBuckets: verifies GET/POST/PUT found
with patterns distributed across specific lanes
- TestFatTeddyCaseFoldRegression: validates match count vs stdlib
for (?i)get|post|put via meta-engine (750/750 on 1000 log lines)
* refactor: remove AC replacement and small-haystack fallback
FatTeddy AVX2 is now correct (ORL fix). Remove:
- buildACPrefilter: replaced ALL FatTeddy with AC (workaround)
- ac_prefilter.go: AC wrapper type (unused)
- fatTeddyFallback path in findIndicesTeddyAt (FatTeddy handles all sizes)
LangArena methods (?i)get|post|put: AC 41ms → FatTeddy 11ms
* perf: prefilter candidate loop in isMatchDFA with anchored verification
isMatchDFA now loops through prefilter candidates and verifies each with
anchored DFA instead of single unanchored PikeVM scan from first candidate.
Issue #137 match case: 176us → 27us (6.5x faster, 3.7x faster than stdlib)
* perf: prefilter candidate loop in findIndicesDFA for large NFA (>100 states)
For patterns with >100 NFA states (e.g., (?i) case-fold), bidirectional DFA
cache-thrashes. Prefilter candidate loop with anchored DFA verification
avoids building full DFA state table.
Guard: nfaStateCount > 100 — small NFAs (34-57 states) use bidirectional
DFA directly (no cache thrashing, lower overhead than candidate loop).
Also adds nfaStateCount field to Engine for the guard check.
* perf: Rust-style cascading prefix trim for >64 literals
When >64 literals extracted, try cascading trim: keep 4 bytes → dedup,
if still >64 → keep 3 bytes → dedup, if still >64 → keep 2 bytes → dedup.
Falls back to original if trimming can't reduce below 64.
Matches Rust's optimize_for_prefix_by_preference ATTEMPTS table.
auth_attempts (?i)/login|/signin: 128 literals → 18 four-byte → Teddy.
34ms → 10ms.
* perf: VPTEST in FatTeddy hot loop — replace 8-instruction candidate detection
VPTEST Y6,Y6 (1 instruction, sets ZF if all zero) replaces:
VPXOR+VPCMPEQB+VPMOVMSKB+NOTL+MOVL+ANDL+SHRL+ORL (8 instructions).
ORL lane combining moved to found_candidate cold path (only on actual
candidates, not every 16-byte chunk).
FatTeddy 35 patterns on 6MB: 55ms → 41ms (24% faster).
* wip: FatTeddy batch ASM + FindAllPositions (intermediate save)
Batch ASM: fatTeddyAVX2_2_batch scans entire haystack in one call,
writes all (pos, bucketMask) candidates to pre-allocated buffer.
Eliminates Go→ASM round trip overhead (11.6 GB/s SIMD vs 283 MB/s with round trips).
FindAllPositions: batch-based FindAll using single ASM call.
Find: uses original per-candidate path (batch too slow for first-match).
* deps: upgrade ahocorasick v0.1.0 → v0.2.1 (DFA + SIMD prefilter, 11-22x)
Flat DFA with premultiplied state IDs, match flag in high bit,
SIMD skip-ahead prefilter via bytes.IndexByte.
Throughput: Find 3.4 GB/s, IsMatch 5.9-7.0 GB/s (was 260-545 MB/s).
* docs: release v0.12.13 — changelog, roadmap, lint fix
* fix: add hasAVX2 and batch stubs for non-amd64 platforms (macOS ARM64)