Skip to content

fix(classifiers): correct [CLS]/[SEP] framing for all NER chunks and wire regex+NER union merge#2258

Merged
bug-ops merged 3 commits intomainfrom
2247-ner-chunk-cls-sep
Mar 27, 2026
Merged

fix(classifiers): correct [CLS]/[SEP] framing for all NER chunks and wire regex+NER union merge#2258
bug-ops merged 3 commits intomainfrom
2247-ner-chunk-cls-sep

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 27, 2026

Closes #2247, #2248.

Summary

  • fix(classifiers): middle NER chunks lack [CLS]/[SEP] framing #2247: Every NER chunk (first, middle, last) is now framed with [CLS] at position 0 and [SEP] at end before the DeBERTa forward pass. Special-token labels are stripped from token_labels before BIO decode, eliminating spurious entity spans at chunk boundaries. Same fix applied to Phase 1 CandleClassifier injection detection. Chunk constants extracted to shared classifier/mod.rs.

  • feat(classifiers): wire regex+NER union merge into ContentSanitizer #2248: ContentSanitizer::sanitize() now uses a unified regex+NER union merge pipeline when pii_enabled = true: PiiFilter::detect_spans() and CandlePiiClassifier NER run sequentially, span lists are merged with O(n) char→byte precompute, overlapping spans are deduped, and a single-pass redaction is applied. Eliminates the double-redaction offset corruption from the prior independent-path design.

Test plan

  • 21 new tests: NER chunk framing (all positions, boundary, special-label stripping), span merge (overlapping, adjacent, contained, empty), char→byte map (ASCII, Unicode), single-pass redact
  • 6752 tests passed (was 6731, +21)
  • cargo clippy --workspace --features full -- -D warnings: clean
  • cargo +nightly fmt --check: clean

@github-actions github-actions bot added documentation Improvements or additions to documentation llm zeph-llm crate (Ollama, Claude) rust Rust code changes core zeph-core crate bug Something isn't working size/XL Extra large PR (500+ lines) labels Mar 27, 2026
@bug-ops bug-ops linked an issue Mar 27, 2026 that may be closed by this pull request
@bug-ops bug-ops enabled auto-merge (squash) March 27, 2026 12:14
@bug-ops bug-ops force-pushed the 2247-ner-chunk-cls-sep branch 2 times, most recently from 3e40eb6 to 2af0826 Compare March 27, 2026 12:39
bug-ops added 3 commits March 27, 2026 13:40
…wire regex+NER union merge

Closes #2247, #2248.

- Every chunk (first, middle, last) now framed with [CLS]/[SEP] before DeBERTa
  forward pass; special-token labels stripped from token_labels before BIO decode
  to eliminate spurious entity spans at chunk boundaries. Same fix applied to
  Phase 1 CandleClassifier injection detection. Chunk constants extracted to
  shared classifier/mod.rs.

- ContentSanitizer::sanitize() now uses unified regex+NER union merge when
  pii_enabled=true: PiiFilter::detect_spans() + CandlePiiClassifier NER run
  sequentially, span lists merged with O(n) char-to-byte precompute, overlapping
  spans deduped, single-pass redaction. Eliminates double-redaction offset
  corruption from prior independent-path design.
…n config field

Restores items dropped during auto-merge with origin/main:
- pub mod ner declaration in classifier/mod.rs
- NerSpan struct definition in classifier/mod.rs
- spans field on ClassificationResult (with vec![] default for sequence classifiers)
- Aligns apply_pii_ner_classifier to use classifiers.pii_model (ner_model was unified
  into pii_model in Phase 2 PR #2251)
- Preserves both apply_pii_classifier and apply_pii_ner_classifier calls in runner.rs
- Keeps with_pii_detector and with_pii_ner_classifier builder methods in agent/builder.rs
…ifiers) block

Without the classifiers feature the import and mut were flagged as unused
by -D warnings (CI bundle checks). Move the import inside the cfg block
and use cfg_attr to suppress the unused_mut lint on the spans binding.
@bug-ops bug-ops force-pushed the 2247-ner-chunk-cls-sep branch from 2af0826 to 6009df3 Compare March 27, 2026 12:41
@bug-ops bug-ops merged commit 913a33a into main Mar 27, 2026
25 checks passed
@bug-ops bug-ops deleted the 2247-ner-chunk-cls-sep branch March 27, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation llm zeph-llm crate (Ollama, Claude) rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(classifiers): wire regex+NER union merge into ContentSanitizer fix(classifiers): middle NER chunks lack [CLS]/[SEP] framing

1 participant