[ty] Add more and update existing projects in `ty_benchmark` by MichaReiser · Pull Request #21536 · astral-sh/ruff

MichaReiser · 2025-11-20T09:21:01Z

Summary

This PR adds more projects to ty_benchmark and updates existing benchmarks. It also adds pyrefly as a benchmark target. I also made some improvements to result rendering and added a check that the command fails if any type checker exits due to an error other than typing errors (requires hyperfine 1.20 or newer).

I don't consider this the final set of projects and I'm happy to add more projects (or remove projects) based on your feedback. Overall, it's fairly tricky to select a set of projects because any project that isn't a library tends to use a mypy-plugin or non-strict type checking options which either results in a lot of diagnostics for type checkers other than the one the project is using, because it would require customizing each type checker's configuration to roughly the same settings. Which I'm not convinced is worth the effort.

We should be careful about drawing early conclusions from the benchmark, especially when comparing ty and pyrefly, because both type checkers are still missing crucial, but different, typing features, where ty is probably a little further behind (at least up to the beta where we add many of those missing large features).

Closes astral-sh/ty#241

I'm not 100% convinced whether we want the snapshotting mechanism, but it's sort of nice to have some way of measuring if the projects still do what one expects them to.

Test Plan

black
-----

Benchmark 1: ty
  Time (mean ± σ):      57.0 ms ±   2.5 ms    [User: 334.9 ms, System: 39.9 ms]
  Range (min … max):    52.8 ms …  64.1 ms    49 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):     159.5 ms ±   9.4 ms    [User: 567.9 ms, System: 179.3 ms]
  Range (min … max):   145.7 ms … 181.8 ms    17 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):      1.188 s ±  0.027 s    [User: 1.132 s, System: 0.052 s]
  Range (min … max):    1.167 s …  1.261 s    10 runs

Benchmark 4: mypy (warm)
  Time (mean ± σ):     130.3 ms ±   1.0 ms    [User: 97.8 ms, System: 30.7 ms]
  Range (min … max):   129.1 ms … 133.7 ms    22 runs

Benchmark 5: Pyright
  Time (mean ± σ):      1.880 s ±  0.069 s    [User: 20.128 s, System: 0.912 s]
  Range (min … max):    1.808 s …  2.040 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    2.29 ± 0.10 times faster than mypy (warm)
    2.80 ± 0.21 times faster than Pyrefly
   20.85 ± 1.02 times faster than mypy
   33.01 ± 1.88 times faster than Pyright

-------------------------------------------------------------------------------

discord.py
----------

Benchmark 1: ty
  Time (mean ± σ):     203.9 ms ±  12.6 ms    [User: 1232.4 ms, System: 94.1 ms]
  Range (min … max):   196.9 ms … 247.0 ms    14 runs

  Warning: Ignoring non-zero exit code.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.

Benchmark 2: Pyrefly
  Time (mean ± σ):     297.6 ms ±  19.7 ms    [User: 2368.9 ms, System: 260.9 ms]
  Range (min … max):   262.9 ms … 333.3 ms    11 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):      6.795 s ±  0.254 s    [User: 6.683 s, System: 0.105 s]
  Range (min … max):    6.352 s …  7.053 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 4: mypy (warm)
  Time (mean ± σ):      5.804 s ±  0.285 s    [User: 5.709 s, System: 0.090 s]
  Range (min … max):    5.315 s …  6.121 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 5: Pyright
  Time (mean ± σ):      4.377 s ±  0.074 s    [User: 53.869 s, System: 1.943 s]
  Range (min … max):    4.227 s …  4.459 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    1.46 ± 0.13 times faster than Pyrefly
   21.46 ± 1.37 times faster than Pyright
   28.46 ± 2.24 times faster than mypy (warm)
   33.32 ± 2.40 times faster than mypy

-------------------------------------------------------------------------------

homeassistant
-------------

Benchmark 1: ty
  Time (mean ± σ):      1.932 s ±  0.057 s    [User: 20.511 s, System: 2.858 s]
  Range (min … max):    1.841 s …  2.016 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):      5.377 s ±  0.018 s    [User: 22.429 s, System: 35.348 s]
  Range (min … max):    5.353 s …  5.410 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):     41.547 s ±  0.465 s    [User: 39.661 s, System: 1.780 s]
  Range (min … max):   40.391 s … 42.065 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 4: mypy (warm)
  Time (mean ± σ):      2.849 s ±  0.024 s    [User: 1.771 s, System: 1.072 s]
  Range (min … max):    2.821 s …  2.897 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 5: Pyright
  Time (mean ± σ):     52.909 s ±  2.973 s    [User: 467.397 s, System: 29.649 s]
  Range (min … max):   48.636 s … 57.453 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    1.47 ± 0.05 times faster than mypy (warm)
    2.78 ± 0.08 times faster than Pyrefly
   21.50 ± 0.68 times faster than mypy
   27.38 ± 1.74 times faster than Pyright

-------------------------------------------------------------------------------

isort
-----

Benchmark 1: ty
  Time (mean ± σ):      41.5 ms ±   3.9 ms    [User: 163.1 ms, System: 20.1 ms]
  Range (min … max):    36.8 ms …  52.0 ms    66 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):     233.3 ms ±  22.2 ms    [User: 734.0 ms, System: 131.1 ms]
  Range (min … max):   197.6 ms … 263.6 ms    14 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):     565.9 ms ±   1.5 ms    [User: 533.8 ms, System: 30.0 ms]
  Range (min … max):   563.7 ms … 569.3 ms    10 runs

Benchmark 4: mypy (warm)
  Time (mean ± σ):     111.1 ms ±   1.7 ms    [User: 85.4 ms, System: 24.2 ms]
  Range (min … max):   108.7 ms … 115.3 ms    24 runs

Benchmark 5: Pyright
  Time (mean ± σ):      6.068 s ±  0.108 s    [User: 28.978 s, System: 1.129 s]
  Range (min … max):    5.872 s …  6.296 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    2.68 ± 0.25 times faster than mypy (warm)
    5.62 ± 0.75 times faster than Pyrefly
   13.63 ± 1.27 times faster than mypy
  146.12 ± 13.86 times faster than Pyright

-------------------------------------------------------------------------------

jinja
-----

Benchmark 1: ty
  Time (mean ± σ):     125.6 ms ±  11.1 ms    [User: 336.7 ms, System: 28.6 ms]
  Range (min … max):   111.7 ms … 145.3 ms    24 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):     166.6 ms ±   5.6 ms    [User: 554.6 ms, System: 103.5 ms]
  Range (min … max):   159.0 ms … 177.1 ms    17 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):     683.8 ms ±   5.3 ms    [User: 648.0 ms, System: 33.3 ms]
  Range (min … max):   678.0 ms … 693.6 ms    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 4: mypy (warm)
  Time (mean ± σ):     378.2 ms ±   2.7 ms    [User: 339.1 ms, System: 36.9 ms]
  Range (min … max):   375.5 ms … 383.0 ms    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 5: Pyright
  Time (mean ± σ):      3.394 s ±  0.156 s    [User: 29.392 s, System: 1.257 s]
  Range (min … max):    3.220 s …  3.707 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    1.33 ± 0.13 times faster than Pyrefly
    3.01 ± 0.27 times faster than mypy (warm)
    5.45 ± 0.48 times faster than mypy
   27.02 ± 2.68 times faster than Pyright

-------------------------------------------------------------------------------

pandas
------

Benchmark 1: ty
  Time (mean ± σ):     708.6 ms ± 177.8 ms    [User: 4575.6 ms, System: 289.8 ms]
  Range (min … max):   451.3 ms … 999.0 ms    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):      1.762 s ±  0.024 s    [User: 17.146 s, System: 1.652 s]
  Range (min … max):    1.728 s …  1.803 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):     21.097 s ±  0.099 s    [User: 20.840 s, System: 0.249 s]
  Range (min … max):   20.990 s … 21.284 s    10 runs

Benchmark 4: mypy (warm)
  Time (mean ± σ):     260.8 ms ±   1.9 ms    [User: 163.1 ms, System: 95.6 ms]
  Range (min … max):   258.5 ms … 264.1 ms    11 runs

Benchmark 5: Pyright
  Time (mean ± σ):     21.739 s ±  1.244 s    [User: 178.356 s, System: 6.204 s]
  Range (min … max):   20.221 s … 23.368 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  mypy (warm) ran
    2.72 ± 0.68 times faster than ty
    6.76 ± 0.10 times faster than Pyrefly
   80.90 ± 0.69 times faster than mypy
   83.37 ± 4.81 times faster than Pyright

-------------------------------------------------------------------------------

pandas-stubs
------------

Benchmark 1: ty
  Time (mean ± σ):      77.7 ms ±  20.6 ms    [User: 307.2 ms, System: 58.0 ms]
  Range (min … max):    57.2 ms … 119.6 ms    47 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):     375.5 ms ±  63.1 ms    [User: 1137.2 ms, System: 530.6 ms]
  Range (min … max):   313.7 ms … 482.7 ms    10 runs

Benchmark 3: mypy
  Time (mean ± σ):      6.344 s ±  0.618 s    [User: 6.186 s, System: 0.153 s]
  Range (min … max):    5.514 s …  7.611 s    10 runs

Benchmark 4: mypy (warm)
  Time (mean ± σ):     231.5 ms ±   4.6 ms    [User: 161.1 ms, System: 68.5 ms]
  Range (min … max):   224.1 ms … 238.3 ms    12 runs

Benchmark 5: Pyright
  Time (mean ± σ):      8.211 s ±  0.184 s    [User: 43.375 s, System: 2.357 s]
  Range (min … max):    8.016 s …  8.659 s    10 runs

Summary
  ty ran
    2.98 ± 0.79 times faster than mypy (warm)
    4.83 ± 1.52 times faster than Pyrefly
   81.68 ± 23.08 times faster than mypy
  105.72 ± 28.14 times faster than Pyright

-------------------------------------------------------------------------------

prefect
-------

Benchmark 1: ty
  Time (mean ± σ):     205.5 ms ±  95.2 ms    [User: 775.7 ms, System: 124.1 ms]
  Range (min … max):    88.9 ms … 432.5 ms    28 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):     681.9 ms ±  11.3 ms    [User: 2525.8 ms, System: 1435.2 ms]
  Range (min … max):   660.8 ms … 694.6 ms    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
  Time (mean ± σ):     779.0 ms ±  48.5 ms    [User: 739.7 ms, System: 36.6 ms]
  Range (min … max):   733.3 ms … 849.8 ms    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 4: mypy (warm)
  Time (mean ± σ):     253.1 ms ±   4.9 ms    [User: 209.3 ms, System: 41.4 ms]
  Range (min … max):   249.0 ms … 262.0 ms    11 runs

  Warning: Ignoring non-zero exit code.

Benchmark 5: Pyright
  Time (mean ± σ):     12.305 s ±  0.840 s    [User: 90.406 s, System: 3.771 s]
  Range (min … max):   11.105 s … 13.787 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    1.23 ± 0.57 times faster than mypy (warm)
    3.32 ± 1.54 times faster than Pyrefly
    3.79 ± 1.77 times faster than mypy
   59.86 ± 28.02 times faster than Pyright

-------------------------------------------------------------------------------

pytorch
-------

Benchmark 1: ty
  Time (mean ± σ):      1.883 s ±  0.112 s    [User: 15.373 s, System: 0.967 s]
  Range (min … max):    1.628 s …  2.022 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: Pyrefly
  Time (mean ± σ):      2.881 s ±  0.025 s    [User: 21.954 s, System: 9.919 s]
  Range (min … max):    2.828 s …  2.915 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: mypy
Time (mean ± σ):     30.414 s ±  0.623 s    [User: 29.780 s, System: 0.540 s]
  Range (min … max):   29.751 s … 31.527 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 4: mypy (warm)
  Time (mean ± σ):     28.403 s ±  0.232 s    [User: 27.680 s, System: 0.684 s]
  Range (min … max):   28.012 s … 28.752 s    10 runs

  Warning: Ignoring non-zero exit code.

Benchmark 5: Pyright
  Time (mean ± σ):     18.750 s ±  0.674 s    [User: 203.035 s, System: 12.257 s]
  Range (min … max):   17.315 s … 19.962 s    10 runs

  Warning: Ignoring non-zero exit code.

Summary
  ty ran
    1.53 ± 0.09 times faster than Pyrefly
    9.96 ± 0.69 times faster than Pyright
   15.08 ± 0.90 times faster than mypy (warm)
   16.15 ± 1.01 times faster than mypy

MichaReiser · 2025-11-20T09:21:21Z

scripts/ty_benchmark/src/benchmark/cases.py

No, github doesn't recognize the move :(

astral-sh-bot · 2025-11-20T09:30:50Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

AlexWaygood

A few small things from skimming!

scripts/ty_benchmark/src/benchmark/__init__.py

scripts/ty_benchmark/pyproject.toml

scripts/ty_benchmark/src/benchmark/projects.py

scripts/ty_benchmark/src/benchmark/snapshot.py

sharkdp

Thank you!

scripts/ty_benchmark/README.md

sharkdp · 2025-11-24T08:07:42Z

scripts/ty_benchmark/README.md

+
+The benchmark script supports snapshoting the results when running with `--snapshot` and `--accept`.
+The goal of those snapshots is to catch accidental regressions. They are not intended
+as a testing tool. E.g. the snapshot runner doesn't account for platform differences so that


Can/should we "pin" the platform to an arbitrary one, i.e. pass the equivalent of ty's --python-platform=linux to all type checkers?

It's tricky because you know Python.

Solving for one platform requires installing the dependency for that platform, and that failed for me at least for some of the projects on macos.

scripts/ty_benchmark/pyproject.toml

sharkdp · 2025-11-24T08:10:46Z

scripts/ty_benchmark/snapshots/pandas-stubs_ty.txt

+pandas-stubs/_typing.pyi:861:44: error[invalid-argument-type] Argument to class `dtype` is incorrect: Expected `generic[Any]`, found `typing.TypeVar`
+pandas-stubs/_typing.pyi:865:48: error[invalid-argument-type] Argument to class `dtype` is incorrect: Expected `generic[Any]`, found `typing.TypeVar`
+pandas-stubs/_typing.pyi:877:53: error[invalid-argument-type] Argument to class `dtype` is incorrect: Expected `generic[Any]`, found `typing.TypeVar`


I'll fix those soon 😄

sharkdp · 2025-11-24T08:12:13Z

scripts/ty_benchmark/src/benchmark/__init__.py

+            # pyright exit codes: https://docs.basedpyright.com/v1.31.6/configuration/command-line/#pyright-exit-codes
+            # pyrefly exit codes: Not documented
+            # ty: https://docs.astral.sh/ty/reference/exit-codes/
+            "-i=1",


Minor: I prefer to always use the --long-form of options when calling tools from scripts

Suggested change

"-i=1",

"--ignore-failure=1",

scripts/ty_benchmark/src/benchmark/tool.py

* main: [ty] Implement `typing.override` (astral-sh#21627) [ty] Avoid expression reinference for diagnostics (astral-sh#21267) [ty] Improve autocomplete suppressions of keywords in variable bindings [ty] Only suggest completions based on text before the cursor Implement goto-definition and find-references for global/nonlocal statements (astral-sh#21616) [ty] Inlay Hint edit follow up (astral-sh#21621) [ty] Implement lsp support for string annotations (astral-sh#21577) [ty] Add 'remove unused ignore comment' code action (astral-sh#21582) [ty] Refactor `CheckSuppressionContext` to use `DiagnosticGuard` (astral-sh#21587) [ty] Improve several "Did you mean?" suggestions (astral-sh#21597) [ty] Add more and update existing projects in `ty_benchmark` (astral-sh#21536) [ty] fix ty playground initialization and vite optimization issues (astral-sh#21471)

MichaReiser added internal An internal refactor or improvement ty Multi-file analysis & type inference labels Nov 20, 2025

MichaReiser commented Nov 20, 2025

View reviewed changes

scripts/ty_benchmark/src/benchmark/cases.py

Copy link

Member Author

MichaReiser Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, github doesn't recognize the move :(

MichaReiser mentioned this pull request Nov 20, 2025

add a test for ty_benchmark that asserts we have roughly the number of expected errors astral-sh/ty#241

Closed

MichaReiser force-pushed the micha/more-benchmarks branch from 1f5303f to 4cb553b Compare November 20, 2025 12:12

MichaReiser marked this pull request as ready for review November 20, 2025 12:21

MichaReiser requested review from AlexWaygood, carljm, dcreager and sharkdp as code owners November 20, 2025 12:21

MichaReiser force-pushed the micha/more-benchmarks branch from d196985 to e10b1e9 Compare November 20, 2025 12:54

MichaReiser mentioned this pull request Nov 20, 2025

[ty] Add error counter to the ty benchmark package #21114

Closed

AlexWaygood reviewed Nov 20, 2025

View reviewed changes

MichaReiser requested a review from charliermarsh November 20, 2025 18:32

carljm removed their request for review November 20, 2025 21:59

charliermarsh approved these changes Nov 21, 2025

View reviewed changes

MichaReiser added 5 commits November 21, 2025 17:49

More benchmarks

db1dcf7

Add diagnostic snapshots

e19d426

Use Pyright's node executable

f1127de

Use persistent configurations

801a18b

Update ty

f69f278

sharkdp approved these changes Nov 24, 2025

View reviewed changes

sharkdp reviewed Nov 24, 2025

View reviewed changes

scripts/ty_benchmark/src/benchmark/tool.py Show resolved Hide resolved

MichaReiser added 2 commits November 24, 2025 15:15

Small windows fixes

9bb6a25

Better windows support

4303563

MichaReiser force-pushed the micha/more-benchmarks branch from a3bdc54 to 4303563 Compare November 24, 2025 18:13

MichaReiser added 3 commits November 24, 2025 19:18

Code review feedback

386afe6

More windows fixes

de18a8d

Skip homeassistant on Windows

1b5c877

MichaReiser merged commit 747c39a into main Nov 25, 2025
37 checks passed

MichaReiser deleted the micha/more-benchmarks branch November 25, 2025 07:58

Comments

Conversation

MichaReiser commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

MichaReiser Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

astral-sh-bot bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

AlexWaygood left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sharkdp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sharkdp Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

MichaReiser Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sharkdp Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

sharkdp Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MichaReiser commented Nov 20, 2025 •

edited

Loading

astral-sh-bot bot commented Nov 20, 2025 •

edited

Loading

`ruff-ecosystem` results