Idea: Add str.tokenize() — a simple method for clean alphanumeric word tokenization
Summary: Add a small, Unicode-aware string method that returns “clean” tokens by extracting consecutive alphanumeric runs (with a few pragmatic internal characters), so common text-cleanup workflows don’t require regex boilerplate.
Note: This is a draft proposal; specifics may need verification or adjustment based on core-dev feedback and existing discussions.
Motivation
The built-in str.split() is excellent for splitting on whitespace, but in many modern Python tasks—search indexing, embeddings, word counts, basic NLP, LLM data prep—people immediately need tokens without attached punctuation. Today, that usually means reaching for re.findall(...) or manual stripping in loops.
Common workaround today
import re
tokens = re.findall(r"\w+", text) # or variations with strip/punctuation
This pattern shows up constantly in tutorials and real code because split() keeps punctuation attached to words (e.g., "hello,").
Proposal
Add a simple string method for the most common tokenization need:
text.tokenize() # -> list[str]
Intended Behavior
- Split on any character that is not part of a valid token.
- A token is a maximal sequence consisting of alphanumeric characters (
isalnum()), with dots (.), hyphens (-), and apostrophes allowed only when internal (i.e., surrounded by alphanumerics on both sides).- Apostrophes include both ASCII
'and the common typographic apostrophe’.
- Apostrophes include both ASCII
- Quotes are not token characters:
- Double quotes (
") and typographic double quotes (“ ”) are always separators. - Single-quote characters are only kept when they function as internal apostrophes (rule above). Quoting like
'word'will still split cleanly to['word'].
- Double quotes (
- Underscores (
_) and all other non-alphanumeric characters (whitespace, punctuation, symbols, standalone dots/hyphens/apostrophes) act as separators. - Collapse consecutive separators (no empty strings produced).
- Ignore leading and trailing separators.
- Empty input →
[]. - Fully Unicode-aware:
str.isalnum()correctly handles accented letters, non-Latin scripts, and digits from any language. - Preserve original case (no lower-casing or normalization).
Deliberate design choices (open for discussion):
- Dots, hyphens, and apostrophes are kept only internally — this preserves URLs (
domain.com), versions (Python-3.12), model names (GPT-4o), compound terms (COVID-19), and contractions/possessives (don't,Jessica’s) while still stripping punctuation in normal prose. - Quotes are separators — surrounding quotes are ignored without becoming part of tokens; typographic quotes don’t “stick” to words.
- Underscores are treated as separators — aligns with modern code style guides (snake_case identifiers are conceptually separate words) and avoids treating
file_nameas a single token. - No underscore in tokens — diverges from regex
\wfor cleaner real-world output (most text processing wants “file name”, not “file_name”).
This behavior aims for a practical balance: simple enough for a stdlib method, yet smart enough to produce immediately usable tokens in the vast majority of real-world text (social media, web scraping, search indexing, LLM prep, log parsing, etc.).
Reference implementation (pure Python sketch)
def tokenize(text: str) -> list[str]:
"""Return a list of clean tokens from text.
Tokens are sequences of alphanumeric characters.
Dots (.), hyphens (-), and apostrophes (both ' and ’) are kept when they appear
*between* alphanumerics (e.g., "Python-3.12", "domain.com", "COVID-19",
"don't", "Jessica’s").
Double quotes (" and “ ”) are separators. Underscores (_) and all other
non-alphanumeric characters act as separators.
"""
if not text:
return []
keep_internal = {".", "-", "'", "’"}
double_quote_separators = {'"', "“", "”", "„", "‟", "«", "»", "‹", "›"}
tokens: list[str] = []
current: list[str] = []
n = len(text)
for i, ch in enumerate(text):
if ch.isalnum():
current.append(ch)
continue
if ch in double_quote_separators:
if current:
tokens.append("".join(current))
current.clear()
continue
if ch in keep_internal:
prev_is_alnum = bool(current) and current[-1].isalnum()
next_is_alnum = i + 1 < n and text[i + 1].isalnum()
if prev_is_alnum and next_is_alnum:
current.append(ch) # internal . or - or apostrophe → keep
continue
# Any other character (space, _, punctuation, standalone . / - / apostrophe)
if current:
tokens.append("".join(current))
current.clear()
if current:
tokens.append("".join(current))
return tokens
(Implementation would likely live in CPython’s string internals for performance and consistency, but the above demonstrates the intended semantics.)
10 Real-World Examples (each includes split() vs tokenize())
- Classic quote
text = "To be or not to be, that is the question!!!"
text.split()
# ['To', 'be', 'or', 'not', 'to', 'be,', 'that', 'is', 'the', 'question!!!']
text.tokenize()
# ['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']
- Contractions and possessives (apostrophes kept)
text = "Don't stop — Jessica’s bike isn't new."
text.split()
# ["Don't", 'stop', '—', 'Jessica’s', 'bike', "isn't", 'new.']
text.tokenize()
# ["Don't", 'stop', 'Jessica’s', 'bike', "isn't", 'new']
- Quotes are separators (not internal)
text = 'He said "don\'t" and walked away.'
text.split()
# ['He', 'said', '"don\'t"', 'and', 'walked', 'away.']
text.tokenize()
# ['He', 'said', "don't", 'and', 'walked', 'away']
- Keyword quoting
text = "Alignment research focuses on 'superintelligence' risks and scalable oversight."
text.split()
# → ['Alignment', 'research', 'focuses', 'on', "'superintelligence'", 'risks', 'and', 'scalable', 'oversight.']
text.tokenize()
# → ['Alignment', 'research', 'focuses', 'on', 'superintelligence', 'risks', 'and', 'scalable', 'oversight']
- Python version
text = "Python3.12 is awesome!!!"
text.split()
# ['Python3.12', 'is', 'awesome!!!']
text.tokenize()
# ['Python3.12', 'is', 'awesome']
- Email parsing
text = "[email protected] -- email me"
text.split()
# ['[email protected]', '--', 'email', 'me']
text.tokenize()
# ['user', 'domain.com', 'email', 'me']
- Price extraction
text = "Price: $19.99 (limited time)"
text.split()
# ['Price:', '$19.99', '(limited', 'time)']
text.tokenize()
# ['Price', '19.99', 'limited', 'time']
- News headline
text = "COVID-19 cases rising in 2025..."
text.split()
# ['COVID-19', 'cases', 'rising', 'in', '2025...']
text.tokenize()
# ['COVID-19', 'cases', 'rising', 'in', '2025']
- Filename with underscores
text = "file_name_v2_final_final.txt"
text.split()
# ['file_name_v2_final_final.txt']
text.tokenize()
# ['file', 'name', 'v2', 'final', 'final.txt']
- Social media post
text = "RT @elonmusk: Mars mission in 2026!!! 🚀🌑"
text.split()
# ['RT', '@elonmusk:', 'Mars', 'mission', 'in', '2026!!!', '🚀🌑']
text.tokenize()
# ['RT', 'elonmusk', 'Mars', 'mission', 'in', '2026']
Why a new method (instead of flags on split())
- Single-purpose, readable, and discoverable (like
strip(),replace(),removeprefix(), etc.). - No backward-compat risk from changing
split()behavior. - Eliminates repeated regex boilerplate for a very common case.
- Aligns with vocabulary in many NLP ecosystems (“tokenize” as the entry-level operation).
Name collision concerns
The existing stdlib tokenize module is for parsing Python source code. This proposal targets general text tokenization; the names are in different namespaces and domains.
Questions for Discussion
- Naming:
str.tokenize()vsstr.tokens()vsstr.word_tokens()vs something else? - Apostrophes: This draft keeps internal apostrophes (
'and’) sodon'tstays intact. Any edge cases where this is harmful? - Allowed internal punctuation: Currently keeps
.,-, and apostrophes only when between alphanumerics. Keep minimal, or broaden? - Underscore handling: Treated as separator (like space), so
file_name→['file', 'name']. Prefer matching\wand keeping underscores? - Alphanumeric base: Uses
str.isalnum()(letters + digits, Unicode-aware). Any other characters worth including? - Normalization & case: No case folding or Unicode normalization (keeps original case, NFC as-is). Add optional parameters, or keep raw/predictable?
- Performance/implementation: Worth implementing in C (like
split()) for large texts? Any Unicode edge cases to watch? - API shape: Standalone method
str.tokenize()vs optional param onsplit(tokenize=True)?
Thanks for reading—curious what core devs and power users think about scope, naming, and edge cases.