Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: coregx/coregex
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.12.12
Choose a base ref
...
head repository: coregx/coregex
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.12.13
Choose a head ref
  • 1 commit
  • 19 files changed
  • 1 contributor

Commits on Mar 18, 2026

  1. fix: FatTeddy AVX2 + prefilter acceleration + AC v0.2.1 (#143)

    * fix: FatTeddy AVX2 ANDL→ORL — correct lane combining for 33-64 patterns
    
    teddyFatAVX2_2 used ANDL to combine low/high 128-bit lane candidate masks.
    This required a match in BOTH lanes, but patterns assigned to a single lane
    (e.g., GET variants in buckets 0-7 = low lane only) were zeroed out.
    Fix: ORL — candidate exists if EITHER lane has a match.
    
    Also adds DECQ SI at handle_tail to cover prev0 carry-over position.
    
    * test: FatTeddy AVX2 regression tests — lane combining + case-fold
    
    - TestFatTeddyAVX2_SingleLaneBuckets: verifies GET/POST/PUT found
      with patterns distributed across specific lanes
    - TestFatTeddyCaseFoldRegression: validates match count vs stdlib
      for (?i)get|post|put via meta-engine (750/750 on 1000 log lines)
    
    * refactor: remove AC replacement and small-haystack fallback
    
    FatTeddy AVX2 is now correct (ORL fix). Remove:
    - buildACPrefilter: replaced ALL FatTeddy with AC (workaround)
    - ac_prefilter.go: AC wrapper type (unused)
    - fatTeddyFallback path in findIndicesTeddyAt (FatTeddy handles all sizes)
    
    LangArena methods (?i)get|post|put: AC 41ms → FatTeddy 11ms
    
    * perf: prefilter candidate loop in isMatchDFA with anchored verification
    
    isMatchDFA now loops through prefilter candidates and verifies each with
    anchored DFA instead of single unanchored PikeVM scan from first candidate.
    
    Issue #137 match case: 176us → 27us (6.5x faster, 3.7x faster than stdlib)
    
    * perf: prefilter candidate loop in findIndicesDFA for large NFA (>100 states)
    
    For patterns with >100 NFA states (e.g., (?i) case-fold), bidirectional DFA
    cache-thrashes. Prefilter candidate loop with anchored DFA verification
    avoids building full DFA state table.
    
    Guard: nfaStateCount > 100 — small NFAs (34-57 states) use bidirectional
    DFA directly (no cache thrashing, lower overhead than candidate loop).
    
    Also adds nfaStateCount field to Engine for the guard check.
    
    * perf: Rust-style cascading prefix trim for >64 literals
    
    When >64 literals extracted, try cascading trim: keep 4 bytes → dedup,
    if still >64 → keep 3 bytes → dedup, if still >64 → keep 2 bytes → dedup.
    Falls back to original if trimming can't reduce below 64.
    
    Matches Rust's optimize_for_prefix_by_preference ATTEMPTS table.
    auth_attempts (?i)/login|/signin: 128 literals → 18 four-byte → Teddy.
    34ms → 10ms.
    
    * perf: VPTEST in FatTeddy hot loop — replace 8-instruction candidate detection
    
    VPTEST Y6,Y6 (1 instruction, sets ZF if all zero) replaces:
    VPXOR+VPCMPEQB+VPMOVMSKB+NOTL+MOVL+ANDL+SHRL+ORL (8 instructions).
    
    ORL lane combining moved to found_candidate cold path (only on actual
    candidates, not every 16-byte chunk).
    
    FatTeddy 35 patterns on 6MB: 55ms → 41ms (24% faster).
    
    * wip: FatTeddy batch ASM + FindAllPositions (intermediate save)
    
    Batch ASM: fatTeddyAVX2_2_batch scans entire haystack in one call,
    writes all (pos, bucketMask) candidates to pre-allocated buffer.
    Eliminates Go→ASM round trip overhead (11.6 GB/s SIMD vs 283 MB/s with round trips).
    
    FindAllPositions: batch-based FindAll using single ASM call.
    Find: uses original per-candidate path (batch too slow for first-match).
    
    * deps: upgrade ahocorasick v0.1.0 → v0.2.1 (DFA + SIMD prefilter, 11-22x)
    
    Flat DFA with premultiplied state IDs, match flag in high bit,
    SIMD skip-ahead prefilter via bytes.IndexByte.
    
    Throughput: Find 3.4 GB/s, IsMatch 5.9-7.0 GB/s (was 260-545 MB/s).
    
    * docs: release v0.12.13 — changelog, roadmap, lint fix
    
    * fix: add hasAVX2 and batch stubs for non-amd64 platforms (macOS ARM64)
    kolkov authored Mar 18, 2026
    Configuration menu
    Copy the full SHA
    7a29fab View commit details
    Browse the repository at this point in the history
Loading