Skip to content

Conversation

@Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Oct 3, 2025

What does this PR do?

Adds (Full) Regex and (Partial) Tokenization GIL=0 free-threading support. Tested up to Python 3.14T.

In simple terms, Transformers code that relies on regex will segfault under true concurrency. I have confirmed with the regex maintainer that the latest GIL=0–compatible regex package is not GIL=0 safe. This has been corroborated by the submitted unit test in this PR, which demonstrates the segfault. Regex caches and reuses compiled patterns internally and executes them, which makes it inherently unsafe without the GIL.

The core idea is not to make the entire Transformers library GIL=0 safe all at once, but rather to adapt it piece by piece until everything is as GIL=0 compatible as possible.

This PR wraps existing regex calls into a thread-locked, protected serial execution pipeline. Files that import regex only need to adjust their import path to from util.safe import regex to minimize migration pain.

- import regex as re
+ from ...utils.safe import regex as re

The current safe wrapper is modular and can be extended to cover additional modules in the future as needed or as issues are discovered.

Since Transformers evolves more rapidly than most packages, introducing a local wrapper may be a pragmatic step while waiting for an upstream fix, which may be delayed. Additionally, many if not most Python APIs are not inherently thread-safe, so applying local wrappers is often necessary beyond just external packages.”

Two new unit test files are included:

A non-crashing test suite that proves the code works under GIL=0 with thread load.

A crash-demonstration test that proves regex code paths segfault without the added protection under thread load.

utils/safe.py

tests/utils/test_safe.py

tests/utils/test_safe_crash.py

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap @SunMarc @Cyrilvallez @gante @ydshieh @stevhliu

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really cool idea! I made one comment about locked, but it makes a lot of sense otherwise.

One potential simplification, though, is that in a lot of cases we import regex when really we only need re; this is a leftover from a time when the built-in re was significantly behind the third-party regex lib. What's the thread-safety status of the built-in re?

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 6, 2025

This seems like a really cool idea! I made one comment about locked, but it makes a lot of sense otherwise.

One potential simplification, though, is that in a lot of cases we import regex when really we only need re; this is a leftover from a time when the built-in re was significantly behind the third-party regex lib. What's the thread-safety status of the built-in re?

Another issue aboutre is it's api is highly unstable. There are breaking changes as recent as 3.12 and I just took a casual glance at the doc: https://docs.python.org/3/library/re.html.

From docs re also cache compiled regex so I think we can assume, without look at the implementation, that it may well also be thread unsafe. Need to double check with internal code.

Note The compiled versions of the most recent patterns passed to [re.compile()](https://docs.python.org/3/library/re.html#re.compile) and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
Changed in version 3.12: In [bytes](https://docs.python.org/3/library/stdtypes.html#bytes) patterns, group name can only contain bytes in the ASCII range (b'\x00'-b'\x7f').
Changed in version 3.12: Group id can only contain ASCII digits. In [bytes](https://docs.python.org/3/library/stdtypes.html#bytes) patterns, group name can only contain bytes in the ASCII range (b'\x00'-b'\x7f').

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Nice PR, we are removing all of these files in favor of just using tokenizers as the backend, and having one sentencepiece_wrapper file, from which we will use your safe import!

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Oct 6, 2025

But happy to review / merge in the mean time!

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 6, 2025

@Rocketknight1 @ArthurZucker Ready for re-review/review. Changes since last review:

  1. Fixed _hf_safe_callable_cache dict was not thread locked for safety so safe would be un-safe. Big oof. Ci test added.
  2. Fixed some meta data such as __package__ actually doesn't always exist. (hit this in my usage with pytorch). Ci test added.
  3. Fixed reentrant usage case where regex callables can actually call itself. Ci test added.
  4. Properties and any helpers have ugly _hf_safe_ prefix added to minimize any namespace collision. It's ugly but it works, I hope.

make fixup is failing on unrelated code so not sure if hf repo formatting is auto applied.

@Rocketknight1
Copy link
Member

Got it! It seems good now, but can we confirm the status of the built-in re in 3.14T? If we can avoid thread-safety wrappers by just swapping to that, I think I'd prefer it, even if there are minor API changes between versions. I don't think we should be affected by those since I doubt we have non-ASCII capture group names.

Of course, if the built-in re has the same thread-safety issues, then we should definitely go with something like this PR

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 8, 2025

@Rocketknight1 I believe move to re should be on another PR (asap due to result from following benchmark) as it would need validate all the existing regex is input/output compatible but re is the future here. pcre2 with JIT is good but can't beat re in mult-threaded at least for this simple bench.

Based on added test (same one regex crashed on), re is thead-safe and very performant even compared with pcre2. Added re test in test_safe_crash.py to show it didn't crash in the same env vs regex lib.

  • safe.regex is performance equivalent to regex: test results within margin of error. Different runs will show one is slightly faster than the other.
  • re is thread-safe
  • regex is 10x slower than re for simple and 7x slower for named groups. This was unexpected.
  • pcre2 is thread safe (non internal pattern caching) but only faster than re in singler threaded JIT-enabled mode.
  • pcre2 is on par with re in threaded tests but slower in single threaded.

Benchmark only tested 2 regex match patterns: simple + named grouped matching.

Conclusion Transformers should move to re for any tokenization related code.

(vm314t) root@gpu-base:~/transformers# PYTHON_GIL=0 python benchmark_re_vs_regex.py
Single-thread loops per matcher: 10000
Threaded loops: 8 threads x 10000 loops (PYTHON_GIL=0)

Pattern case: simple
  expr: (Transformers|transformers|models|Models)


Single-thread
Library        Time (s)  Ops/s        Matches    Total Ops
-----------  ----------  ---------  ---------  -----------
re               0.0057  1,752,847      10000        10000
regex            0.0592  169,017        10000        10000
safe.regex       0.0606  165,077        10000        10000
pcre2            0.0148  677,142        10000        10000
pcre2 (jit)      0.0146  682,972        10000        10000


Threaded
Library        Time (s)  Ops/s      Matches    Total Ops
-----------  ----------  -------  ---------  -----------
re               0.1749  457,282      80000        80000
safe.regex       1.8877  42,381       80000        80000
pcre2            0.3338  239,667      80000        80000
pcre2 (jit)      0.3201  249,944      80000        80000

Pattern case: named-groups
  expr: (?P<head>\w+)\s+(?P<tail>\w+)


Single-thread
Library        Time (s)  Ops/s      Matches    Total Ops
-----------  ----------  -------  ---------  -----------
re               0.0125  802,368      10000        10000
regex            0.0793  126,103      10000        10000
safe.regex       0.0668  149,606      10000        10000
pcre2            0.0152  660,057      10000        10000
pcre2 (jit)      0.015   665,377      10000        10000


Threaded
Library        Time (s)  Ops/s      Matches    Total Ops
-----------  ----------  -------  ---------  -----------
re               0.2172  368,257      80000        80000
safe.regex       1.826   43,811       80000        80000
pcre2            0.2245  356,328      80000        80000
pcre2 (jit)      0.2467  324,233      80000        80000

Benchmark code:

#!/usr/bin/env python
"""Micro benchmark comparing stdlib re with transformers.utils.safe.regex and pcre2"""

import os
import re
import threading
import time
from typing import Callable, Sequence, Tuple

from tabulate import tabulate

from transformers.utils.safe import regex as safe_regex
import regex
import pcre2

SINGLE_THREAD_LOOPS = 10_000
THREAD_LOOPS = 10_000
THREAD_COUNT = 8
SIMPLE_PATTERN = r"(Transformers|transformers|models|Models)"
SIMPLE_TEXT = "Transformers by Hugging Face enable models across tasks and devices."

COMPLEX_PATTERN = r"(?P<head>\w+)\s+(?P<tail>\w+)"
COMPLEX_TEXT = "Transformers models"


def string_matcher_factory(
    match_func: Callable[[str, str], object]
) -> Callable[[str, str], Callable[[], bool]]:
    def factory(pattern: str, text: str) -> Callable[[], bool]:
        def _match() -> bool:
            return match_func(pattern, text) is not None

        return _match

    return factory


def pcre2_compiled_factory(jit: bool) -> Callable[[str, str], Callable[[], bool]]:
    def factory(pattern: str, text: str) -> Callable[[], bool]:
        compiled = _get_pcre2_pattern(pattern, jit=jit)
        match = compiled.match

        def _match() -> bool:
            return match(text) is not None

        return _match

    return factory


def benchmark_single(label: str, match_callable: Callable[[], bool]):
    start = time.perf_counter()
    matches = 0
    for _ in range(SINGLE_THREAD_LOOPS):
        matches += match_callable()
    duration = time.perf_counter() - start
    ops_per_sec = SINGLE_THREAD_LOOPS / duration if duration else float("inf")
    return duration, ops_per_sec, matches


def benchmark_threaded(label: str, match_callable: Callable[[], bool]):
    total_matches = [0] * THREAD_COUNT

    def worker(slot: int) -> None:
        local_match = match_callable
        count = 0
        for _ in range(THREAD_LOOPS):
            count += local_match()
        total_matches[slot] = count

    threads = [threading.Thread(target=worker, args=(idx,), name=f"{label}-worker-{idx}") for idx in range(THREAD_COUNT)]
    start = time.perf_counter()
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    duration = time.perf_counter() - start
    total_ops = THREAD_COUNT * THREAD_LOOPS
    total = sum(total_matches)
    ops_per_sec = total_ops / duration if duration else float("inf")
    return duration, ops_per_sec, total

MatcherFactory = Callable[[str, str], Callable[[], bool]]


_PCRE2_CACHE: dict[tuple[str, bool], pcre2.Pattern] = {}
_PCRE2_CACHE_LOCK = threading.Lock()


def _get_pcre2_pattern(pattern: str, *, jit: bool) -> pcre2.Pattern:
    key = (pattern, jit)
    cached = _PCRE2_CACHE.get(key)
    if cached is not None:
        return cached

    with _PCRE2_CACHE_LOCK:
        cached = _PCRE2_CACHE.get(key)
        if cached is None:
            cached = pcre2.compile(pattern, jit=jit)
            _PCRE2_CACHE[key] = cached
        return cached


SINGLE_THREAD_LIBRARIES: Sequence[Tuple[str, MatcherFactory]] = (
    ("re", string_matcher_factory(re.match)),
    ("regex", string_matcher_factory(regex.match)),
    ("safe.regex", string_matcher_factory(safe_regex.match)),
    ("pcre2", pcre2_compiled_factory(jit=False)),
    ("pcre2 (jit)", pcre2_compiled_factory(jit=True)),
)


THREADED_LIBRARIES: Sequence[Tuple[str, MatcherFactory]] = (
    ("re", string_matcher_factory(re.match)),
    ("safe.regex", string_matcher_factory(safe_regex.match)),
    ("pcre2", pcre2_compiled_factory(jit=False)),
    ("pcre2 (jit)", pcre2_compiled_factory(jit=True)),
)


PATTERN_CASES: Sequence[Tuple[str, str, str]] = (
    ("simple", SIMPLE_PATTERN, SIMPLE_TEXT),
    ("named-groups", COMPLEX_PATTERN, COMPLEX_TEXT),
)


if __name__ == "__main__":
    print(f"Single-thread loops per matcher: {SINGLE_THREAD_LOOPS}")
    total_thread_ops = THREAD_COUNT * THREAD_LOOPS
    os.environ["PYTHON_GIL"] = "0"
    print(
        f"Threaded loops: {THREAD_COUNT} threads x {THREAD_LOOPS} loops (PYTHON_GIL={os.environ['PYTHON_GIL']})"
    )

    for case_label, pattern, text in PATTERN_CASES:
        print()
        print(f"Pattern case: {case_label}")
        print(f"  expr: {pattern}")

        single_rows = []
        for label, matcher_factory in SINGLE_THREAD_LIBRARIES:
            duration, ops_per_sec, matches = benchmark_single(
                label, matcher_factory(pattern, text)
            )
            single_rows.append(
                [label, f"{duration:.4f}", f"{ops_per_sec:,.0f}", matches, SINGLE_THREAD_LOOPS]
            )
        print("\n\nSingle-thread")
        print(
            tabulate(
                single_rows,
                headers=["Library", "Time (s)", "Ops/s", "Matches", "Total Ops"],
            )
        )

        threaded_rows = []
        for label, matcher_factory in THREADED_LIBRARIES:
            duration, ops_per_sec, matches = benchmark_threaded(
                label, matcher_factory(pattern, text)
            )
            threaded_rows.append(
                [
                    label,
                    f"{duration:.4f}",
                    f"{ops_per_sec:,.0f}",
                    matches,
                    total_thread_ops,
                ]
            )
        print("\n\nThreaded")
        print(
            tabulate(
                threaded_rows,
                headers=["Library", "Time (s)", "Ops/s", "Matches", "Total Ops"],
            )
        )

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 8, 2025

@Rocketknight1 @ArthurZucker Existing tokenizer code use \p{...} unicode property for matching which stdlb.re does not yet support so re is out of the question. So the future question is should Transofmers migrate to pcre2 instead which matches re in thread safety, threaded speed with full pcre compatible syntax support.

# tokenization_qwen2.py
import regex as re
...
PRETOKENIZE_REGEX = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""

Files using regex and \p{...}

  - src/transformers/models/bart/tokenization_bart.py
  - src/transformers/models/blenderbot/tokenization_blenderbot.py
  - src/transformers/models/clip/tokenization_clip.py
  - src/transformers/models/clvp/tokenization_clvp.py
  - src/transformers/models/codegen/tokenization_codegen.py
  - src/transformers/models/deberta/tokenization_deberta.py
  - src/transformers/models/deprecated/tapex/tokenization_tapex.py
  - src/transformers/models/got_ocr2/convert_got_ocr2_weights_to_hf.py
  - src/transformers/models/gpt2/tokenization_gpt2.py
  - src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py
  - src/transformers/models/led/tokenization_led.py
  - src/transformers/models/longformer/tokenization_longformer.py
  - src/transformers/models/luke/tokenization_luke.py
  - src/transformers/models/markuplm/tokenization_markuplm.py
  - src/transformers/models/mllama/convert_mllama_weights_to_hf.py
  - src/transformers/models/mvp/tokenization_mvp.py
  - src/transformers/models/qwen2/tokenization_qwen2.py
  - src/transformers/models/roberta/tokenization_roberta.py
  - src/transformers/models/whisper/tokenization_whisper.py

Comment on lines +117 to +143
@pytest.mark.xfail(strict=False, reason="Compiled regex still crashes under PYTHON_GIL=0")
def test_compiled_regex_thread_safety_crashes_under_gil0():
script = _regex_thread_script(
imports="import regex",
setup_code="""
compiled_pattern = regex.compile(pattern_text)
def match_once():
return compiled_pattern.match(text_to_match)
""",
)

result, message = _run_regex_thread_script("compiled regex", script)

if result.returncode == 0:
pytest.fail("compiled regex unexpectedly behaved thread-safely\n" + message)

if result.returncode == -11:
message += "\nProcess terminated with SIGSEGV (Segmentation fault)."

if message:
sys.stderr.write(message + "\n")
sys.stderr.flush()

pytest.fail(message)


Copy link
Contributor Author

@Qubitium Qubitium Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new test to show regex is crashing with GIL=0 even when the pattern is not going through the caching mechanism. This is strange and unexpected. It appears regex lib is fully unsafe as each compiled pattern maybe holding or sharing with other pkg level persistent buffers or runtime stack.

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 8, 2025

Update benchmark adding python-pcre comparison. Looks ike python-pcre is clear winner. Unfortunately I checked the python-pcre package and it has not been updated in 10 years with performance PRs not reviewed/merge and also pcre 8.xx is hard deprecated by pcre.org.

Single-thread loops per matcher: 10000
Threaded loops: 8 threads x 10000 loops (PYTHON_GIL=0)

Pattern case: simple
  expr: (Transformers|transformers|models|Models)


Single-thread
Library        Time (s)  Ops/s        Matches    Total Ops
-----------  ----------  ---------  ---------  -----------
re               0.0059  1,683,494      10000        10000
regex            0.069   144,929        10000        10000
safe.regex       0.0623  160,490        10000        10000
pcre             0.0055  1,827,192      10000        10000
pcre (jit)       0.0057  1,759,222      10000        10000
pcre2            0.0159  629,413        10000        10000
pcre2 (jit)      0.0145  690,844        10000        10000


Threaded
Library        Time (s)  Ops/s        Matches    Total Ops
-----------  ----------  ---------  ---------  -----------
re               0.109   734,058        80000        80000
safe.regex       1.7284  46,286         80000        80000
pcre             0.0528  1,516,316      80000        80000
pcre (jit)       0.0544  1,471,912      80000        80000
pcre2            0.1472  543,591        80000        80000
pcre2 (jit)      0.1346  594,521        80000        80000

Pattern case: named-groups
  expr: (?P<head>\w+)\s+(?P<tail>\w+)


Single-thread
Library        Time (s)  Ops/s        Matches    Total Ops
-----------  ----------  ---------  ---------  -----------
re               0.0193  517,076        10000        10000
regex            0.0825  121,173        10000        10000
safe.regex       0.0619  161,673        10000        10000
pcre             0.0055  1,815,004      10000        10000
pcre (jit)       0.0064  1,571,113      10000        10000
pcre2            0.0167  598,186        10000        10000
pcre2 (jit)      0.0153  653,871        10000        10000


Threaded
Library        Time (s)  Ops/s        Matches    Total Ops
-----------  ----------  ---------  ---------  -----------
re               0.1167  685,416        80000        80000
safe.regex       1.7527  45,643         80000        80000
pcre             0.0528  1,515,403      80000        80000
pcre (jit)       0.0517  1,546,954      80000        80000
pcre2            0.123   650,669        80000        80000
pcre2 (jit)      0.1326  603,108        80000        80000

update benchmark.py

#!/usr/bin/env python
"""Micro benchmark comparing stdlib re with transformers.utils.safe.regex."""

import os
import re
import threading
import time
from typing import Callable, Sequence, Tuple

from tabulate import tabulate

from transformers.utils.safe import regex as safe_regex
import regex
import pcre2

try:
    import pcre
except ImportError as exc:  # pragma: no cover - optional dependency
    raise ImportError(
        "python-pcre is required for benchmark_re_vs_regex.py; install it with `pip install python-pcre`."
    ) from exc

SINGLE_THREAD_LOOPS = 10_000
THREAD_LOOPS = 10_000
THREAD_COUNT = 8
SIMPLE_PATTERN = r"(Transformers|transformers|models|Models)"
SIMPLE_TEXT = "Transformers by Hugging Face enable models across tasks and devices."

COMPLEX_PATTERN = r"(?P<head>\w+)\s+(?P<tail>\w+)"
COMPLEX_TEXT = "Transformers models"


def string_matcher_factory(
    match_func: Callable[[str, str], object]
) -> Callable[[str, str], Callable[[], bool]]:
    def factory(pattern: str, text: str) -> Callable[[], bool]:
        def _match() -> bool:
            return match_func(pattern, text) is not None

        return _match

    return factory


def pcre2_compiled_factory(jit: bool) -> Callable[[str, str], Callable[[], bool]]:
    def factory(pattern: str, text: str) -> Callable[[], bool]:
        compiled = _get_pcre2_pattern(pattern, jit=jit)
        match = compiled.match

        def _match() -> bool:
            return match(text) is not None

        return _match

    return factory


def benchmark_single(label: str, match_callable: Callable[[], bool]):
    start = time.perf_counter()
    matches = 0
    for _ in range(SINGLE_THREAD_LOOPS):
        matches += match_callable()
    duration = time.perf_counter() - start
    ops_per_sec = SINGLE_THREAD_LOOPS / duration if duration else float("inf")
    return duration, ops_per_sec, matches


def benchmark_threaded(label: str, match_callable: Callable[[], bool]):
    total_matches = [0] * THREAD_COUNT

    def worker(slot: int) -> None:
        local_match = match_callable
        count = 0
        for _ in range(THREAD_LOOPS):
            count += local_match()
        total_matches[slot] = count

    threads = [threading.Thread(target=worker, args=(idx,), name=f"{label}-worker-{idx}") for idx in range(THREAD_COUNT)]
    start = time.perf_counter()
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    duration = time.perf_counter() - start
    total_ops = THREAD_COUNT * THREAD_LOOPS
    total = sum(total_matches)
    ops_per_sec = total_ops / duration if duration else float("inf")
    return duration, ops_per_sec, total

MatcherFactory = Callable[[str, str], Callable[[], bool]]


_PCRE_CACHE: dict[tuple[str, bool], pcre.Pattern] = {}
_PCRE_CACHE_LOCK = threading.Lock()
_PCRE2_CACHE: dict[tuple[str, bool], pcre2.Pattern] = {}
_PCRE2_CACHE_LOCK = threading.Lock()


def _get_pcre_pattern(pattern: str, *, jit: bool) -> pcre.Pattern:
    key = (pattern, jit)
    cached = _PCRE_CACHE.get(key)
    if cached is not None:
        return cached

    with _PCRE_CACHE_LOCK:
        cached = _PCRE_CACHE.get(key)
        if cached is None:
            cached = pcre.compile(pattern)
            if jit:
                cached.study(pcre.STUDY_JIT)
                if hasattr(cached, "set_jit_stack"):
                    cached.set_jit_stack(32 * 1024, 512 * 1024)
            _PCRE_CACHE[key] = cached
        return cached


def _get_pcre2_pattern(pattern: str, *, jit: bool) -> pcre2.Pattern:
    key = (pattern, jit)
    cached = _PCRE2_CACHE.get(key)
    if cached is not None:
        return cached

    with _PCRE2_CACHE_LOCK:
        cached = _PCRE2_CACHE.get(key)
        if cached is None:
            cached = pcre2.compile(pattern, jit=jit)
            _PCRE2_CACHE[key] = cached
        return cached



def pcre_compiled_factory(jit: bool) -> Callable[[str, str], Callable[[], bool]]:
    def factory(pattern: str, text: str) -> Callable[[], bool]:
        compiled = _get_pcre_pattern(pattern, jit=jit)
        match = compiled.match

        def _match() -> bool:
            return match(text) is not None

        return _match

    return factory


SINGLE_THREAD_LIBRARIES: Sequence[Tuple[str, MatcherFactory]] = (
    ("re", string_matcher_factory(re.match)),
    ("regex", string_matcher_factory(regex.match)),
    ("safe.regex", string_matcher_factory(safe_regex.match)),
    ("pcre", pcre_compiled_factory(jit=False)),
    ("pcre (jit)", pcre_compiled_factory(jit=True)),
    ("pcre2", pcre2_compiled_factory(jit=False)),
    ("pcre2 (jit)", pcre2_compiled_factory(jit=True)),
)


THREADED_LIBRARIES: Sequence[Tuple[str, MatcherFactory]] = (
    ("re", string_matcher_factory(re.match)),
    ("safe.regex", string_matcher_factory(safe_regex.match)),
    ("pcre", pcre_compiled_factory(jit=False)),
    ("pcre (jit)", pcre_compiled_factory(jit=True)),
    ("pcre2", pcre2_compiled_factory(jit=False)),
    ("pcre2 (jit)", pcre2_compiled_factory(jit=True)),
)


PATTERN_CASES: Sequence[Tuple[str, str, str]] = (
    ("simple", SIMPLE_PATTERN, SIMPLE_TEXT),
    ("named-groups", COMPLEX_PATTERN, COMPLEX_TEXT),
)


if __name__ == "__main__":
    print(f"Single-thread loops per matcher: {SINGLE_THREAD_LOOPS}")
    total_thread_ops = THREAD_COUNT * THREAD_LOOPS
    os.environ["PYTHON_GIL"] = "0"
    print(
        f"Threaded loops: {THREAD_COUNT} threads x {THREAD_LOOPS} loops (PYTHON_GIL={os.environ['PYTHON_GIL']})"
    )

    for case_label, pattern, text in PATTERN_CASES:
        print()
        print(f"Pattern case: {case_label}")
        print(f"  expr: {pattern}")

        single_rows = []
        for label, matcher_factory in SINGLE_THREAD_LIBRARIES:
            duration, ops_per_sec, matches = benchmark_single(
                label, matcher_factory(pattern, text)
            )
            single_rows.append(
                [label, f"{duration:.4f}", f"{ops_per_sec:,.0f}", matches, SINGLE_THREAD_LOOPS]
            )
        print("\n\nSingle-thread")
        print(
            tabulate(
                single_rows,
                headers=["Library", "Time (s)", "Ops/s", "Matches", "Total Ops"],
            )
        )

        threaded_rows = []
        for label, matcher_factory in THREADED_LIBRARIES:
            duration, ops_per_sec, matches = benchmark_threaded(
                label, matcher_factory(pattern, text)
            )
            threaded_rows.append(
                [
                    label,
                    f"{duration:.4f}",
                    f"{ops_per_sec:,.0f}",
                    matches,
                    total_thread_ops,
                ]
            )
        print("\n\nThreaded")
        print(
            tabulate(
                threaded_rows,
                headers=["Library", "Time (s)", "Ops/s", "Matches", "Total Ops"],
            )
        )

@Rocketknight1
Copy link
Member

Interesting, and thank you for the comprehensive benchmarks! I'd have to discuss with the team, but I suspect our stance is that we want to support free-threading, but not at the cost of other PRs that have a high chance of breaking things. If we can do a drop-in replacement for regex then I think we can support that, but if we have to move to a new regex engine and start changing patterns across the codebase then we might have to just wait for patches to regex itself.

That said, regex performance is usually not a bottleneck for tokenization, let alone LLM inference in general, so if pcre2 works perfectly but is just a little slower than regex or re then that's probably fine!

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 8, 2025

Interesting, and thank you for the comprehensive benchmarks! I'd have to discuss with the team, but I suspect our stance is that we want to support free-threading, but not at the cost of other PRs that have a high chance of breaking things. If we can do a drop-in replacement for regex then I think we can support that, but if we have to move to a new regex engine and start changing patterns across the codebase then we might have to just wait for patches to regex itself.

That said, regex performance is usually not a bottleneck for tokenization, let alone LLM inference in general, so if pcre2 works perfectly but is just a little slower than regex or re then that's probably fine!

Agreed. That's for a future (potential) PR and can be discussed with more detail later. I actually went over all the pcre wrappers and/or compat regex engine for python and each has problems or issues that I want to address in my own pcre pkg. They either are not maintained, compile pcre from source (ouch) and/or have usability/compat or GIL=0 issues.

Let's just return our focus to this simpler PR instead which has no compat issues but inherit all the slowness of regex.

@Rocketknight1
Copy link
Member

Rocketknight1 commented Oct 9, 2025

Yeah - based on your benchmarks (which were really helpful, thank you!), I think a good series of PRs would be:

  1. This PR
  2. Move all regexes that don't need features from regex to re, this should be 95% of regexes in the library
  3. Adapt regexes with the \p tag from regex to re if possible (can it be done a different way?)
  4. After Python 3.11, replace regex with re for regexes using atomic grouping or possessive quantifiers. Support for these was only added to re in 3.11.
  5. If possible, after Py3.11 we can remove regex and the Safe wrapper entirely

I kind of changed my mind a bit - although I said initially that keeping existing patterns intact made more sense, big performance gains and removing the need for threadsafe wrappers might be worth altering a few regexes

@Rocketknight1
Copy link
Member

Either way, I think this is almost ready! @Qubitium can you check the CI errors? I think in some cases models have relative imports, but they're in deprecated/ so the relative import needs to go one further level up before it works

Also, since this is a library-wide change that adds a wrapper to regex, cc @ArthurZucker @Cyrilvallez as core maintainers to make sure they're okay with it

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 9, 2025

Either way, I think this is almost ready! @Qubitium can you check the CI errors? I think in some cases models have relative imports, but they're in deprecated/ so the relative import needs to go one further level up before it works

Relative paths import for the 2 deprecated tokenizers fixed.

Also, since this is a library-wide change that adds a wrapper to regex, cc @ArthurZucker @Cyrilvallez as core maintainers to make sure they're okay with it

@ArthurZucker @Cyrilvallez Ready for reviews. This is PR is a stop-gap, short term solution.

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 9, 2025

I kind of changed my mind a bit - although I said initially that keeping existing patterns intact made more sense, big performance gains and removing the need for threadsafe wrappers might be worth altering a few regexes

Hold your thoughts. Just released pypcre package on pypi which is my reimainging of what a fast, thread safe, and usable (pythonic) pcre2 based regular expression library. Also out of the box re compatible and one toggle away from regex compatiblity. Oh..we added a special test_transformers_regex.py test case which pulled all of transformer's patterns and validated compat. Still a little early but I will likely make a new PR to replace safe.regex with pypcre once I am confident it is a cake with no compromises. Validated on Linux, MacOS, Windows, WSL, with freebsd and solaris pending.

https://github.com/ModelCloud/PyPcre/releases/tag/v0.1.0

@Rocketknight1
Copy link
Member

Very interesting, but we might be wary of depending on a library that new, until we're confident that it's going to be maintained over time!

Still, if it's straightforward bindings to a well-maintained lib and it solves all the regex problems, we might consider it. Definitely leaving that one to the core maintainers, though 😅

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 11, 2025

Very interesting, but we might be wary of depending on a library that new, until we're confident that it's going to be maintained over time!

Still, if it's straightforward bindings to a well-maintained lib and it solves all the regex problems, we might consider it. Definitely leaving that one to the core maintainers, though 😅

Challenge accepted. lol. pypcre v0.2.0 just dropped. Not only is it full gil=0 compliant, it is also much faster than anything out there. The more thread you use, the faster our advantege.

Btw, I know this is getting out of topic but since it is technically relevant I will drop this:

After looking regex and re internals it has become clear to me why re and regex may seem fast (single threaded) when compared to pcre linked packages.

Both re and regex do not need to allocate extra memory when doing regular expression searches on target aka subject because they both operat at the raw python ucs2/4 like (custom) encoding/byte string. Where as pcre expects and only operates on utf based byte strings so almost every call to pcre require a new allocation plus dealllocation (post call) for the subject.

re is fast not because it has better regular expression engine or is smarter, it is not. Pcre is hands down the best imho. But, re is optmiized for Python with a subset of pcre features. regex tries to be fast using same raw pytho string memory searches but it fails at thead safety and concurrency.

However, despite regex using python memory and avoiding one allocation and one deallocation per call, we, as in pypcre, is already beating regex at latency if you use module level api (the most common) like re.match() since most users are too lazy to write pattern = re.compile() + pattern.match().

@Rocketknight1
Copy link
Member

Rocketknight1 commented Oct 13, 2025

Hey! After some internal discussion, people aren't ready to depend on ModelCloud/PyPcre! The work you're doing is very impressive, but transformers is a big library that cares a lot about stability, and everyone's worried about a hard dep on a 1-week-old library, lol

I think the plan we want is this series of PRs instead:

  1. Move as many regexes as possible to re, since most of them are already compatible. Remove regex imports in files that don't need it.

  2. Rewrite regexes with unsupported features like \p{...}

  3. For regexes with features supported after 3.11, like atomic groups, we can either rewrite them, or if that's not possible then those files can have a conditional import of regex only for Python 3.10.x. If regex isn't installed then they can raise an error telling the user to either install regex or update their Python version.

Even at the cost of performance in some cases, depending on the internal lib maximizes future support and minimizes maintenance and dependency headaches we have to worry about later.

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 14, 2025

Hey! After some internal discussion, people aren't ready to depend on ModelCloud/PyPcre! The work you're doing is very impressive, but transformers is a big library that cares a lot about stability, and everyone's worried about a hard dep on a 1-week-old library, lol

Challenge accepted. Now is not the time or place to fight this good fight but Transformers will come back to pypcre. =) The time will come sooner rather tha later. I will sit on the rock like master turtle and wait for the tea laves to drop (they are about to).

  1. Move as many regexes as possible to re, since most of them are already compatible. Remove regex imports in files that don't need it.

Agreed. This is the best short term fix.

  1. Rewrite regexes with unsupported features like \p{...}

I would not recommend this. \p is powerful syntax. Once you use it, you never go back. Any conversion of \p to replacement is asking users to not use nice high level syntax and drop to lower level ones. I have also written a non-opensourced golang pcre wrapper using same performance techniques in pypcre, for many years, for the exact reasons of why should I abandon the usage of \p{} because golang does not give it to me?

On another tangent but GIL=0 relatd. ALL of transformers that use triton kernels are not GIL=0 safe. I tried to use safe.py wrappers and failed so upstreamed the fix instead.

Fix autotune thread safety (crash) under GIL=0 PR #8437 so maybe if you guys can help secondarily review/validate this PR for me that will be great too as it affects transformers as much as triton. Newer models ones like qwen3-next which recommends flash linear attention and demands triton. Also any model that uses triton kernels with autoune enabled are affected.

@Rocketknight1 Can you do final review on your end and check if there is anything else needed for this PR? I see that the github status is showing you have been requested as reviewer but completed review of this PR. I know other's also need to review this but good to get as much reviewer onboard in the mean time.

@Rocketknight1
Copy link
Member

Rocketknight1 commented Oct 14, 2025

Hmmn, I'm unsure! I think we'd like to pause this PR for the moment and start by moving regexes to re, and then after that we can see what's left, and how many patterns actually need to be rewritten or wrapped with safe.regex. Sorry for the confusion - there was definitely some internal discussion about this!

Would you be willing to leave this for the moment and make a separate PR to switch files to re where possible? If not it's totally okay, we can try to assign someone to it.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: bart, bertweet, blenderbot, blenderbot_small, clip, clvp, codegen, ctrl, deberta, deepseek_vl, deepseek_vl_hybrid, depth_pro, fastspeech2_conformer, got_ocr2, gpt2, gpt_oss

@Qubitium
Copy link
Contributor Author

Would you be willing to leave this for the moment and make a separate PR to switch files to re where possible? If not it's totally okay, we can try to assign someone to it.

Sure. Let's pause this PR until you guys decide. Please assign someone to do the re conversion work as that would fix 90% of the gotchas in threading. Though I hope that transformers not go backwards and reinvent the wheel with \p. Pcre2 is as stable and advanced of a platform for regular expression as you can get. Intel has their own subset, maximized for parallel throughput but can't backtrack, google has re2 but for security reasons, the list goes on and on. Everyone is a subset of Pcre2 but no one is above it or matching it when it comes to feature set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants