Skip to content

fixing sentencepiece detection for transformers 5.0+ (still backwards compatible)#373

Merged
HenryNdubuaku merged 3 commits intomainfrom
sentencepiece-detection-transformers-5.0
Feb 23, 2026
Merged

fixing sentencepiece detection for transformers 5.0+ (still backwards compatible)#373
HenryNdubuaku merged 3 commits intomainfrom
sentencepiece-detection-transformers-5.0

Conversation

@ncylich
Copy link
Copy Markdown
Collaborator

@ncylich ncylich commented Feb 20, 2026

Tested with both 4.57.7 and 5.2.0
across the following models:
gemma3 270m
qwen3 0.6b
lfm2 1.2b
nomic embed text v2 moe

works by first checking if the model is local and examining its folder for sp model, then checks the hf cache, and if that doesn't exist, checks the remote files list

Copilot AI review requested due to automatic review settings February 20, 2026 01:55
@ncylich ncylich force-pushed the sentencepiece-detection-transformers-5.0 branch from 0314014 to afcedf3 Compare February 20, 2026 01:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the HuggingFace tokenizer conversion logic to more reliably detect and locate SentencePiece .model files across newer Transformers versions, while aiming to stay compatible with older setups.

Changes:

  • Added _find_sp_model() to locate a SentencePiece model from a local directory, the HF cache, or the Hub repo file list.
  • Switched SentencePiece vs BPE branching in convert_hf_tokenizer() to rely on the discovered .model path rather than tokenizer attributes.
  • Adjusted Gemma special-token backfill to use the discovered SentencePiece model path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ncylich and others added 2 commits February 19, 2026 22:57
@HenryNdubuaku HenryNdubuaku merged commit addad5d into main Feb 23, 2026
1 of 2 checks passed
ncylich added a commit that referenced this pull request Feb 24, 2026
… compatible) (#373)

* fixing sentencepiece detection for transformers 5.0+ (still backwards compatible)

Signed-off-by: Noah Cylich <[email protected]>

* cleaned unused import and made file finding more consistent

Signed-off-by: Noah Cylich <[email protected]>

* cleanup

Signed-off-by: HenryNdubuaku <[email protected]>

---------

Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Co-authored-by: HenryNdubuaku <[email protected]>
cattermelon1234 pushed a commit to cattermelon1234/cactus that referenced this pull request Feb 28, 2026
… compatible) (cactus-compute#373)

* fixing sentencepiece detection for transformers 5.0+ (still backwards compatible)

Signed-off-by: Noah Cylich <[email protected]>

* cleaned unused import and made file finding more consistent

Signed-off-by: Noah Cylich <[email protected]>

* cleanup

Signed-off-by: HenryNdubuaku <[email protected]>

---------

Signed-off-by: Noah Cylich <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Co-authored-by: HenryNdubuaku <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants