fix(classifiers): correct [CLS]/[SEP] framing for all NER chunks and wire regex+NER union merge#2258
Merged
fix(classifiers): correct [CLS]/[SEP] framing for all NER chunks and wire regex+NER union merge#2258
Conversation
3e40eb6 to
2af0826
Compare
…wire regex+NER union merge Closes #2247, #2248. - Every chunk (first, middle, last) now framed with [CLS]/[SEP] before DeBERTa forward pass; special-token labels stripped from token_labels before BIO decode to eliminate spurious entity spans at chunk boundaries. Same fix applied to Phase 1 CandleClassifier injection detection. Chunk constants extracted to shared classifier/mod.rs. - ContentSanitizer::sanitize() now uses unified regex+NER union merge when pii_enabled=true: PiiFilter::detect_spans() + CandlePiiClassifier NER run sequentially, span lists merged with O(n) char-to-byte precompute, overlapping spans deduped, single-pass redaction. Eliminates double-redaction offset corruption from prior independent-path design.
…n config field Restores items dropped during auto-merge with origin/main: - pub mod ner declaration in classifier/mod.rs - NerSpan struct definition in classifier/mod.rs - spans field on ClassificationResult (with vec![] default for sequence classifiers) - Aligns apply_pii_ner_classifier to use classifiers.pii_model (ner_model was unified into pii_model in Phase 2 PR #2251) - Preserves both apply_pii_classifier and apply_pii_ner_classifier calls in runner.rs - Keeps with_pii_detector and with_pii_ner_classifier builder methods in agent/builder.rs
…ifiers) block Without the classifiers feature the import and mut were flagged as unused by -D warnings (CI bundle checks). Move the import inside the cfg block and use cfg_attr to suppress the unused_mut lint on the spans binding.
2af0826 to
6009df3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2247, #2248.
Summary
fix(classifiers): middle NER chunks lack [CLS]/[SEP] framing #2247: Every NER chunk (first, middle, last) is now framed with
[CLS]at position 0 and[SEP]at end before the DeBERTa forward pass. Special-token labels are stripped fromtoken_labelsbefore BIO decode, eliminating spurious entity spans at chunk boundaries. Same fix applied to Phase 1CandleClassifierinjection detection. Chunk constants extracted to sharedclassifier/mod.rs.feat(classifiers): wire regex+NER union merge into ContentSanitizer #2248:
ContentSanitizer::sanitize()now uses a unified regex+NER union merge pipeline whenpii_enabled = true:PiiFilter::detect_spans()andCandlePiiClassifierNER run sequentially, span lists are merged with O(n) char→byte precompute, overlapping spans are deduped, and a single-pass redaction is applied. Eliminates the double-redaction offset corruption from the prior independent-path design.Test plan
cargo clippy --workspace --features full -- -D warnings: cleancargo +nightly fmt --check: clean