Path to this page:
./
wip/py-tokenizers,
TODO: Short description of the package
Branch: CURRENT,
Version: 0.22.1,
Package name: py313-tokenizers-0.22.1,
Maintainer: pkgsrc-usersProvides an implementation of today's most used tokenizers, with a
focus on performance and versatility.
Bindings over the Rust implementation. If you are interested in the
High-level design, you can go check it there.
Main features:
* Train new vocabularies and tokenize using 4 pre-made tokenizers
(Bert WordPiece and the 3 most common BPE versions).
* Extremely fast (both training and tokenization), thanks to the Rust
implementation. Takes less than 20 seconds to tokenize a GB of text
on a server's CPU.
* Easy to use, but also extremely versatile.
* Designed for research and production.
* Normalization comes with alignments tracking. It's always possible
to get the part of the original sentence that corresponds to a given
token.
* Does all the pre-processing: Truncate, Pad, add the special tokens
your model needs.
Master sites:
Filesize: 354.612 KB
Version history: (Expand)
- (2025-11-22) Package added to pkgsrc.se, version py313-tokenizers-0.22.1 (created)