Add str.tokenize() for basic alphanumeric word tokenization

Idea: Add str.tokenize() — a simple method for clean alphanumeric word tokenization

Summary: Add a small, Unicode-aware string method that returns “clean” tokens by extracting consecutive alphanumeric runs (with a few pragmatic internal characters), so common text-cleanup workflows don’t require regex boilerplate.

Note: This is a draft proposal; specifics may need verification or adjustment based on core-dev feedback and existing discussions.


Motivation

The built-in str.split() is excellent for splitting on whitespace, but in many modern Python tasks—search indexing, embeddings, word counts, basic NLP, LLM data prep—people immediately need tokens without attached punctuation. Today, that usually means reaching for re.findall(...) or manual stripping in loops.

Common workaround today

import re

tokens = re.findall(r"\w+", text)  # or variations with strip/punctuation

This pattern shows up constantly in tutorials and real code because split() keeps punctuation attached to words (e.g., "hello,").


Proposal

Add a simple string method for the most common tokenization need:

text.tokenize()  # -> list[str]

Intended Behavior

  • Split on any character that is not part of a valid token.
  • A token is a maximal sequence consisting of alphanumeric characters (isalnum()), with dots (.), hyphens (-), and apostrophes allowed only when internal (i.e., surrounded by alphanumerics on both sides).
    • Apostrophes include both ASCII ' and the common typographic apostrophe .
  • Quotes are not token characters:
    • Double quotes (") and typographic double quotes (“ ”) are always separators.
    • Single-quote characters are only kept when they function as internal apostrophes (rule above). Quoting like 'word' will still split cleanly to ['word'].
  • Underscores (_) and all other non-alphanumeric characters (whitespace, punctuation, symbols, standalone dots/hyphens/apostrophes) act as separators.
  • Collapse consecutive separators (no empty strings produced).
  • Ignore leading and trailing separators.
  • Empty input → [].
  • Fully Unicode-aware: str.isalnum() correctly handles accented letters, non-Latin scripts, and digits from any language.
  • Preserve original case (no lower-casing or normalization).

Deliberate design choices (open for discussion):

  • Dots, hyphens, and apostrophes are kept only internally — this preserves URLs (domain.com), versions (Python-3.12), model names (GPT-4o), compound terms (COVID-19), and contractions/possessives (don't, Jessica’s) while still stripping punctuation in normal prose.
  • Quotes are separators — surrounding quotes are ignored without becoming part of tokens; typographic quotes don’t “stick” to words.
  • Underscores are treated as separators — aligns with modern code style guides (snake_case identifiers are conceptually separate words) and avoids treating file_name as a single token.
  • No underscore in tokens — diverges from regex \w for cleaner real-world output (most text processing wants “file name”, not “file_name”).

This behavior aims for a practical balance: simple enough for a stdlib method, yet smart enough to produce immediately usable tokens in the vast majority of real-world text (social media, web scraping, search indexing, LLM prep, log parsing, etc.).


Reference implementation (pure Python sketch)

def tokenize(text: str) -> list[str]:
    """Return a list of clean tokens from text.

    Tokens are sequences of alphanumeric characters.
    Dots (.), hyphens (-), and apostrophes (both ' and ’) are kept when they appear
    *between* alphanumerics (e.g., "Python-3.12", "domain.com", "COVID-19",
    "don't", "Jessica’s").

    Double quotes (" and “ ”) are separators. Underscores (_) and all other
    non-alphanumeric characters act as separators.
    """
    if not text:
        return []

    keep_internal = {".", "-", "'", "’"}
    double_quote_separators = {'"', "“", "”", "„", "‟", "«", "»", "‹", "›"}

    tokens: list[str] = []
    current: list[str] = []
    n = len(text)

    for i, ch in enumerate(text):
        if ch.isalnum():
            current.append(ch)
            continue

        if ch in double_quote_separators:
            if current:
                tokens.append("".join(current))
                current.clear()
            continue

        if ch in keep_internal:
            prev_is_alnum = bool(current) and current[-1].isalnum()
            next_is_alnum = i + 1 < n and text[i + 1].isalnum()
            if prev_is_alnum and next_is_alnum:
                current.append(ch)  # internal . or - or apostrophe → keep
                continue

        # Any other character (space, _, punctuation, standalone . / - / apostrophe)
        if current:
            tokens.append("".join(current))
            current.clear()

    if current:
        tokens.append("".join(current))

    return tokens

(Implementation would likely live in CPython’s string internals for performance and consistency, but the above demonstrates the intended semantics.)


10 Real-World Examples (each includes split() vs tokenize())

  1. Classic quote
text = "To be or not to be, that is the question!!!"

text.split()
# ['To', 'be', 'or', 'not', 'to', 'be,', 'that', 'is', 'the', 'question!!!']

text.tokenize()
# ['To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']
  1. Contractions and possessives (apostrophes kept)
text = "Don't stop — Jessica’s bike isn't new."

text.split()
# ["Don't", 'stop', '—', 'Jessica’s', 'bike', "isn't", 'new.']

text.tokenize()
# ["Don't", 'stop', 'Jessica’s', 'bike', "isn't", 'new']
  1. Quotes are separators (not internal)
text = 'He said "don\'t" and walked away.'

text.split()
# ['He', 'said', '"don\'t"', 'and', 'walked', 'away.']

text.tokenize()
# ['He', 'said', "don't", 'and', 'walked', 'away']
  1. Keyword quoting
text = "Alignment research focuses on 'superintelligence' risks and scalable oversight."

text.split()
# → ['Alignment', 'research', 'focuses', 'on', "'superintelligence'", 'risks', 'and', 'scalable', 'oversight.']

text.tokenize()
# → ['Alignment', 'research', 'focuses', 'on', 'superintelligence', 'risks', 'and', 'scalable', 'oversight']
  1. Python version
text = "Python3.12 is awesome!!!"

text.split()
# ['Python3.12', 'is', 'awesome!!!']

text.tokenize()
# ['Python3.12', 'is', 'awesome']
  1. Email parsing
text = "[email protected] -- email me"

text.split()
# ['[email protected]', '--', 'email', 'me']

text.tokenize()
# ['user', 'domain.com', 'email', 'me']
  1. Price extraction
text = "Price: $19.99 (limited time)"

text.split()
# ['Price:', '$19.99', '(limited', 'time)']

text.tokenize()
# ['Price', '19.99', 'limited', 'time']
  1. News headline
text = "COVID-19 cases rising in 2025..."

text.split()
# ['COVID-19', 'cases', 'rising', 'in', '2025...']

text.tokenize()
# ['COVID-19', 'cases', 'rising', 'in', '2025']
  1. Filename with underscores
text = "file_name_v2_final_final.txt"

text.split()
# ['file_name_v2_final_final.txt']

text.tokenize()
# ['file', 'name', 'v2', 'final', 'final.txt']
  1. Social media post
text = "RT @elonmusk: Mars mission in 2026!!! 🚀🌑"

text.split()
# ['RT', '@elonmusk:', 'Mars', 'mission', 'in', '2026!!!', '🚀🌑']

text.tokenize()
# ['RT', 'elonmusk', 'Mars', 'mission', 'in', '2026']

Why a new method (instead of flags on split())

  • Single-purpose, readable, and discoverable (like strip(), replace(), removeprefix(), etc.).
  • No backward-compat risk from changing split() behavior.
  • Eliminates repeated regex boilerplate for a very common case.
  • Aligns with vocabulary in many NLP ecosystems (“tokenize” as the entry-level operation).

Name collision concerns

The existing stdlib tokenize module is for parsing Python source code. This proposal targets general text tokenization; the names are in different namespaces and domains.


Questions for Discussion

  1. Naming: str.tokenize() vs str.tokens() vs str.word_tokens() vs something else?
  2. Apostrophes: This draft keeps internal apostrophes (' and ) so don't stays intact. Any edge cases where this is harmful?
  3. Allowed internal punctuation: Currently keeps ., -, and apostrophes only when between alphanumerics. Keep minimal, or broaden?
  4. Underscore handling: Treated as separator (like space), so file_name['file', 'name']. Prefer matching \w and keeping underscores?
  5. Alphanumeric base: Uses str.isalnum() (letters + digits, Unicode-aware). Any other characters worth including?
  6. Normalization & case: No case folding or Unicode normalization (keeps original case, NFC as-is). Add optional parameters, or keep raw/predictable?
  7. Performance/implementation: Worth implementing in C (like split()) for large texts? Any Unicode edge cases to watch?
  8. API shape: Standalone method str.tokenize() vs optional param on split(tokenize=True)?

Thanks for reading—curious what core devs and power users think about scope, naming, and edge cases.

1 Like

The rules are very complicated and not obvious. If you need different rules, you still need regex boilerplate. Why can’t this be a pypi package?

11 Likes

You’re right that full, perfect tokenization for all languages and edge cases is extremely complicated — that’s why libraries like spaCy, NLTK, or Hugging Face tokenizers exist with models and language-specific rules.

This proposal isn’t trying to replace those.

It’s aiming for the 80–90% case that nearly every Python developer hits constantly: basic clean word-like tokens from messy text (web scraping, logs, user input, search prep, LLM data cleaning) — without pulling in re and writing findall(r’\w+', text) or manual punctuation stripping every single time.

Yes, you can put this on PyPI — and many have (e.g., textacy, wordsegment, tiny tokenizers). But:

  • Most developers don’t know they exist or don’t want another dependency for something this fundamental.

  • str.split() is already there and used everywhere — but gives dirty output that almost always needs cleanup.

  • Having a good default built-in lowers the barrier dramatically, especially for beginners, scripts, and teaching.

Think of it like str.removeprefix() / removesuffix() (added in 3.9) — you could always do it with slicing or regex, but now there’s a clear, readable way.

Or pathlib — you could use os.path, but the object-oriented API is so much better it’s now the default in docs.

A solid str.tokenize() (with sensible, documented rules) would become the new default people reach for — just like split() today — but without the punctuation garbage.

PyPI is great for advanced or experimental stuff. The stdlib is for raising the floor — giving everyone a better starting point.

If the rules here aren’t perfect for your use case — fair! Use a dedicated tokenizer. But for the vast majority of quick-and-dirty text processing in Python scripts today, this would save thousands of lines of duplicated regex boilerplate.

That’s the goal: not perfection, but massive reduction in friction for the common case.

Yet another proposal where I have no idea how much of the proposal came from an LLM rather than actually from the OP.

Not going to engage. Stop using unacknowledged AI to generate your posts.

11 Likes

I spent around 6 hours creating a well-thought-out post, so I wouldn’t waste the community’s time with a frivolous idea. Ironically, I thought it would get less engagement if I just threw up a short, unpolished idea without doing any research.

I created the function/method to clean up some trouble I was having with split(). I did use an LLM to help research if similar solutions exist and to help come up with test cases that show how it works better. I often write knowledgebase articles and technical manuals as part of my job. I regret that the polished presentation was offensive. I will try to keep it more raw next time so it’s received better.

Hope all is well.

1 Like

While I try not to accuse people of posting LLM-generated proposals unless it’s very clear that they did, your post does look rather similar to what LLMs produce. Can you at least confirm whether you did or did not use an LLM to generate (or even “polish”) your proposal?

The proposal is long enough that I lost interest before reading it fully, but I did skim it enough to know that my main question isn’t answered: why not just use re.findall(r'\w+', text)? It’s not a “workaround”, as you describe it, it’s using features of the language which are explicitly designed for use cases like this, to solve the problem.

The rules may not match your rules for tokenize(), but that itself demonstrates that there isn’t a single, “obvious” right answer here for how to define a token, and so giving users a toolkit that lets them define tokens however they want (which is exactly what re.findall is) is far better than building one definition into the language, and leaving everyone with different needs no better off.

13 Likes

I used an LLM to provide both a template and to proofread the proposal, assuming it would help communicate in a cleaner, clearer vernacular. I wrote the original code myself while attending edube.org ‘s Python courses. One of the challenges was writing split() from scratch, and what I ended up with is a way to make split() even better with tokenize(). I saw that split left words mangled and with punctuation often breaking them. I thought I would share the idea with people who might also find it useful. Based on the replies, I’m guessing that both providing a lengthy proposal and the need to tokenize aren’t as valuable as I had hoped.

1 Like

It looks both too specialised and too limited to be a string method.

Consider numbers, for example.

English uses “,” as a thousands separator and “.” as a decimal point, whereas other languages use do the opposite, or use " " as a thousands separator, or whatever.

A user might want to preserve those numbers, perhaps removing the thousands separator.

A user might want to treat “_” as part of an identifier, or split the string into tokens, including the separators.

A module would be a better fit, allowing for more customisation options to be added later.

8 Likes

Thanks for the clarification.

LLMs in my experience have a strong tendency to be overly verbose, which is both detrimental to the clarity of the proposal, and also inconsiderate of the time of people reading it. I personally don’t mind people using LLMs to review their proposals[1], but the text you post should always be your own words[2] - I want to know what you think, not what an LLM considers good arguments.

There is an extremely relevant quote from Antoine de Saint-Exupéry:

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

It’s a lot more effort to describe your proposal concisely, but it’s well worth that effort, and important to do. LLMs are counter-productive when it comes to making an argument concise. You need to realise that getting a language change accepted is a lot of work, and if you’re not even willing to put in the work at the start to ensure that your proposal is brief and to the point, it suggests you won’t be willing to do what’s needed to write a PEP and get it accepted, so why should people bother engaging with your proposal?

And yes, for the people who are familiar with my other posts, I’m 100% aware of the irony of me, of all people, pointing out that brevity is important :slightly_smiling_face:


  1. Actually, I do mind, but I don’t think it’s my place to tell people what tools they can use for their own purposes ↩︎

  2. Yes, even if English isn’t your native language - I’d rather read your best attempt to express yourself in English than an LLM’s perfect grammar making arguments you didn’t make yourself ↩︎

10 Likes

Heh, same :slight_smile:

You’re absolutely right. The nuances get several layers deep. What seems like an elephant in the room to me is that I haven’t actually seen anyone ever use just a plain split() method. It is always preceded or followed by a string-cleaning method to correct the dirty list. I started as a process engineer before learning to code, and the rule was that when 80% of the time a workaround is needed, it’s usually better to change the original process. I wanted a cleaned_split() method, but after researching it, realized it would be better to rename it to tokenize(), since that method is already used in at least one other language. Do you ever use a bare split()?

Feedback well received. I’m one of those weird dinosaurs that actually doesn’t speak much until I have an original idea, and then I don’t shut up. My spouse and bosses have frequently told me to get to the point. I don’t know if it comes from public speaking or teaching, where I have to talk for hours at a time, or the fact that I usually drink too much coffee, but I will set a goal in the new year to simmer down to a meaningful reduction. Do you think it would be valuable to do that to this post now, or just take the criticism and move forward?

1 Like

Looking at the first numbered example from your first post, I can’t imagine a method called tokenize() (which by name alone I’d expect to only do some kind of splitting and whitespace removal) to just throw away characters like @ or !. It could be argued whether such tokens should be split away from the word (like in a programming language tokenizer), or kept together (like .split()), but discarding them away sounds just wrong.

What seems like an elephant in the room to me is that I haven’t actually seen anyone ever use just a plain split() method. It is always preceded or followed by a string-cleaning method to correct the dirty list.

Can you show these actual real-world examples? (I mean the inputs that require custom hand-written splitting, not the implementations themselves)

1 Like

I’ve sometimes used a bare split(), other times used re.findall(), depending on what I needed to do.

2 Likes

@pf_moore , I just found this in research. All of you have probably heard it a number of times. It supports your key points pretty well. import this:

1  Beautiful is better than ugly.
2  Explicit is better than implicit.
3  Simple is better than complex.
4  Complex is better than complicated.
5  Flat is better than nested.
6  Sparse is better than dense.
7  Readability counts.
8  Special cases aren't special enough to break the rules.
9  Although practicality beats purity.
10 Errors should never pass silently.
11 Unless explicitly silenced.
12 In the face of ambiguity, refuse the temptation to guess.
13 There should be one-- and preferably only one --obvious way to do it.
14 Although that way may not be obvious at first unless you're Dutch.
15 Now is better than never.
16 Although never is often better than *right now*.
17 If the implementation is hard to explain, it's a bad idea.
18 If the implementation is easy to explain, it may be a good idea.
19 Namespaces are one honking great idea -- let's do more of those!

It seems like lines 1-7 support your philosophy that that of others on this thread that python is a language of simplicity and brevity. Knowing this, I know more about the culture of its programmers.

I would argue that the case I am trying to make is from lines 8-9. It seems like the world has evolved where split() is the special case and no longer the primary use, and that its practical for efficiency sake to not have to call two methods or more for 80-90% of the use cases.

I’m afraid of antagonizing the TLDR/LLM opposers, but since you explicitly asked for a short list of examples:

  1. Facebook Research irt-leaderboard – MRQA eval script: re.sub(...) + ' '.join(text.split()) for answer normalization (split alone doesn’t remove articles/punct or normalize spacing). (GitHub)
  2. Apple ml-qreccevaluate_qa.py: QA metrics do regex-based cleanup + whitespace normalization using split()/join() (GitHub)
  3. AWS sample “semantic search” notebook: classic doc.lower().split() followed by token.strip(string.punctuation)—split alone leaves word, / word. garbage. (GitHub)
  4. Cohere developer notebook (rerank demo): same lower().split() then strip(string.punctuation) to get usable tokens for search/IR-ish preprocessing. (GitHub)
  5. StanfordNLP dspygsm8k.py dataset loader: split() plus extra cleanup (replace(",", ""), etc.) because dataset text has formatting that split() can’t sanitize. (GitHub)
  6. Princeton-NLP HELMETeval_alce.py: mixes split() with .rstrip(string.punctuation) / re.sub(r"\n+", " ", ...) because raw split output is messy for eval/reporting. (GitHub)

These are big tech names using workarounds to get tokens because Python doesn’t have a clear solution. It shows the glaring problem that there is a violation of Python principle 13 from above. There should be one-- and preferably only one–obvious way to do it.

Ah, yes. I didn’t realise you hadn’t seen the Zen before. It’s easy to take it too literally, but it encapsulates many of the important design principles of Python.

I don’t see your point regarding (8). It’s intended[1] to argue against proposals that say things like “it’s OK for this proposal to not conform to the usual rules because it’s a special case”. But your proposal is for a new str method, which is perfectly in line with the general language approach to such things. So (8) doesn’t really apply - although I could argue that your specific rules for tokenisation are a “special case” of string parsing, and as such aren’t special enough to warrant breaking the “rule” that you should use a regex and re.findall :wink: But that’s a bit of a stretch.

split() is definitely not a special case, in the way you suggest. I use it all the time for simple parsing of data that’s in a known word-based format. Your suggestion that 80-90% of cases where text is split into tokens definitely needs some facts to back it up - in my experience, it’s a massive over-statement, and completely misrepresents the reality.

Another often-quoted “rule” which I think applies here is “not every 3-line function should be a builtin”. Which is saying that re.findall(r'\w+', text) is fine, and doesn’t need a builtin form[2].


  1. At least the way I interpret it :slightly_smiling_face: ↩︎

  2. And nor does your more complex definition, it just needs a more complex regex… ↩︎

7 Likes

I didn’t check all the examples, but these don’t all seem to use the same rules. So your specific proposal for str.tokenize() will almost certainly be wrong for them.

Having a number of building blocks that users can combine to get the precise tokenisation/normalisation rules that are needed for their use case isn’t a failing - quite the opposite, it’s a strength of the language. Forcing everyone to use the “standard” normalisation rules or roll their own will only result in either people ignoring the standard method, or people being frustrated that Python doesn’t give them the results they want.

7 Likes

As shown, this tokenizer does not preserve the notion of sentences and is thus usable just for very short texts. IMO, for one line of text is the split just fine, but my point is different. Even a tokenizer for short texts would be English only without some language-specific settings. The decimal comma was mentioned already, but in many languages is space used as a thousands separator, so “10 000” should be one token in those languages. And did you know the greek work “ό,τι”? Yes, it’s one word, it should be just one token too. Etc.

1 Like

Fair enough. I only use Python for NLP and API calls so maybe it’s not the majority of cases for everyone. Perhaps it’s only 80-90% of the cases in my environment and that can’t be universally extended.

1 Like