Guidelines for acceptable performance optimizations?

Python has been my main language since 2003, and I figured I should finally try to contribute directly to the language itself. My first idea was that I would look for simple micro-optimization ideas in the stdlib. My first PR was rejected (fair), with the advice that I should look for clear runtime improvements. Trouble is, I am not sure what the threshold for clear is, so I am not sure what kinds of changes I should try to think about.

I have tried to find documents or guidelines regarding this, but so far no luck. If you know of such a thing, could point me to it/them?

If there is no such thing, could we come up with something?

Of course, even with threshold guidelines a PR could be rejected because it could be seen as more trouble than it is worth.

I can think of some different categories like:

  • import speed optimization
  • non-import speed optimizations for code that is typically not in hot loops (urlib, logging, …?)
  • typical hot loop performance optimizations (enum, pathlib, …?)
  • test speed optimizations
  • other (doc generation, …?)

Can you think of different categories? Do any numbers come into mind when you think of these categories?

Worth doing at all?

2 Likes

Hey @heikkitoivonen, thank you for trying to contribute to CPython performance!

The standard way CPython performance is evaluated is using the PyPerformance benchmark suite, which you can run locally to measure the impact of proposed optimizations (check out the usage docs, and make sure to compile CPython with optimizations enabled, including PGO. also pay attention to benchmark stability considerations).

Any optimization that results in visible impact on the benchmark suite should be “acceptable”. Often, the impact reported is the geometric mean across all benchmarks, but sometimes (especially with micro-benchmarks) you may report stat-sig results on individual benchmarks (or subsets), and usually those are also “acceptable”, assuming there are no significant regressions in other benchmarks (or overall geometric mean). Do expect noisy benchmarks. I recommend running several A/A tests on your machine to get a sense of benchmark stability and noise-level, so you have a better idea for what “stat-sig impact” looks like (e.g. some benchmarks can have noise level ~0.1% so improving it by 0.5% may be stat-sig, while other benchmarks can have noise level ~3% so 0.5% wouldn’t be considered a meaningful improvement).

Also keep in mind that while you’re probably developing and measuring using a specific “set up”, it is quite possible that you measure an improvement in your set up that would be different (or even a regression) in other configurations. Typically relevant parameters here include OS, CPU architecture, compiler & compiler version, etc.

Another often overlooked consideration is memory. It may be possible to achieve significant runtime improvements at the expense of increasing memory footprint, and it’s important to consider the trade-off.

You may also consider contributing additional benchmarks to PyPerformance itself (or create custom benchmarks that aren’t park of PyPerformance).

It’s important to remember that “performance” is always in the context of a specific workload, and we understand that PyPerformance doesn’t represent all workloads out there. In some cases, you may have an optimization that doesn’t have any visible impact on PyPerformance, but does considerably improve some real-world workload - if you have such a case, the optimization may be “acceptable” based on reported impact on the real-world workload.

Hope this helps!

3 Likes

Thanks for the response. I am aware of the benchmark suite, but I have not run it yet because I was thinking that the micro-optimizations I had in my initial PR probably would not show up due to being so small (for example ~2 ÎĽs improvement for importing a specific module).

If we only look at the benchmark suite I think we could be leaving reasonable improvements out of scope. For example, imagine someone comes up with a micro-optimization idea that improves the performance of lots of stdlib calls by 1 ÎĽs, and if we do 1,000 line changes then benchmarks show ~1% improvement above noise for some cases. So as a whole it could be deemed significant, but due to being such a large change it would probably be rejected.

But if we had additional guidelines that said for hot path optimizations we can consider 1 μs improvements even if they don’t show in the benchmark suite, then a PR could be made to change just a few lines in a specific module. Over time these small PRs could have significant impact.

Having said all that, I will go play with the benchmark suite to try and figure out what kind of changes would be visible in that.

Here are a few random thoughts:

CPython is a somewhat conservative project. Most importantly because it’s easy to introduce subtle behaviour changes that maintainers may not notice (see also Hyrum’s law).

So for perf PRs there is a burden of proof that contributions need to meet. PRs should demonstrate impact, in order of preference: broadly across pyperformance, on a user’s real life production workload, narrowly on a single pyperformance benchmark, in a microbenchmark. If there are tradeoffs being made (e.g. on readability or memory usage) they need to be justified.

Microoptimisations can be a moving target, where CPython determines which way the targets move. For instance, the microoptimisation of caching method lookups now actually regresses performance on modern versions of CPython. CPython core devs often prefer to trust that future versions of the interpreter will optimise idiomatic readable Python, rather than accepting tradeoffs on idiomaticness, readability, churn.

The smaller the benefit, the more robust and representative the benchmark demonstrating the benefit should be.

Changes with questionable benefit often beget more changes with questionable benefit, in a way that can tax maintainers. CPython tends to weight churn as a cost somewhat highly. It’s usually the changes with the most questionable benefit that generate taxing debate — after all, the changes with inarguable benefit tend to be inarguable.

If you have some change in mind, it doesn’t hurt to open an issue to discuss it! As you can maybe tell, the guidelines here aren’t precisely defined, so as long as you’re making a good faith attempt to learn where the lines are (and are willing to accept the occasional no), everyone is more than happy for would-be contributors to explore potential contributions.

5 Likes

My personal categorisation would be:

  1. Core. compiler, jit, etc…
  2. Extension / object modules
  3. Pure Python modules

(2) and (3) each could be split into 2 parts:
A. Code optimizations
B. Algorithm work

I think you are correct.
There are many things that could be worthwhile to do that will not show up there.

That is true, but there are various things to consider.

I think the one that stuck with me most is - a more generic optimization that covers the whole area could be more valuable than many special cases. Thus, often special case complexity is undesirable as it bloats the code and makes it harder to digest the whole picture for those who are working on things that have potential to improve performance of the whole domain or whatever.

E.g. Special casing comparisons within min for different objects vs strategy to have auto-special-casing everywhere.

@picnixz mentioned following guidelines in my PR, so adding here as well:

  • 2-3% improvements on macro benchmarks are fine.

  • =10% improvements on micro benchmarks are fine.

For micro benchmarks, we should also consider how the function is used in an application and whether it’s the bottleneck or not. Or if the function is very important, on microbenchmarks 5% is also fine.