Skip to content

Conversation

@AJMansfield
Copy link
Contributor

@AJMansfield AJMansfield commented Sep 26, 2025

Summary

This reworks mp_print_strn to use a stack-allocated padding buffer rather than special-cased hardcoded ROM strings in order to reduce code size and improve string formatting performance.

Note that this is actually just as performant, even for zeroes and spaces! On my RP2350 Cortex M33 hardware, spaces are about 1% faster for short-padding cases, and 3.4% faster for long-padding cases.

Testing

This PR passes all existing string formatting tests. It also includes a number of additional internal_bench benchmarks that verify the claimed performance improvement.

I've run these benchmarks using my Pico2 Cortex M33 @ 300MHz, with the following results:

test original
718b28a
updated
674d78f
internal_bench/strformat-1-int.py 0.697 0.681
internal_bench/strformat-2.1.1-int-pad5-space.py 0.720 0.713
internal_bench/strformat-2.1.2-int-pad5-unusual.py 0.720 0.713
internal_bench/strformat-2.1.3-int-pad5-center.py 0.718 0.714
internal_bench/strformat-2.1.4-int-pad5-left.py 0.723 0.712
internal_bench/strformat-2.2.1-int-pad20-space.py 0.779 0.788
internal_bench/strformat-2.2.2-int-pad20-unusual.py 0.848 0.788
internal_bench/strformat-2.2.3-int-pad20-center.py 0.848 0.786
internal_bench/strformat-2.2.4-int-pad20-left.py 0.847 0.788
internal_bench/strformat-2.3.1-int-pad100-space.py 0.977 0.979
internal_bench/strformat-2.3.2-int-pad100-unusual.py 1.429 0.979
internal_bench/strformat-2.3.3-int-pad100-center.py 1.423 0.987
internal_bench/strformat-2.3.4-int-pad100-left.py 1.417 0.990
internal_bench/strformat-2.4.1-int-pad500-space.py 2.202 2.127
internal_bench/strformat-2.4.2-int-pad500-unusual.py 4.690 2.127
internal_bench/strformat-2.4.3-int-pad500-center.py 4.658 2.144
internal_bench/strformat-2.4.4-int-pad500-left.py 4.625 2.198

Note that the run-to-run variation I've observed testing this on my embedded board is in the sub-10us range; however there's also been another observed variation in the 5-8ms range related to the git repository state (i.e. building from a dirty workspace that still has staged changes leads to benchmark results nearly 8ms faster than building otherwise-identical code after committing those exact changes). Even if a similar variation is affecting this comparison, though, this is still enough to conclude that the performance in the long-padding and unusual-character cases is better, and that the performance for the short-padding cases is not worse.

Trade-offs and Alternatives

I've done some cursory tests for alternate values of PAD_BUF_SIZE, but the results definitely won't generalize to other architectures, and probably not even to other implementations of the same architecture. The buffer size of 20 is chosen as the smallest size that easily admits a later implementation of #18092 to support padding with grouping characters, while not pessimizing the short-padding cases any more than required.

I've also explored alternatives involving using alloca for the padding buffer, but the conditionals and fallback logic needed to bound stack usage for the pathological cases end up pessimizing code size beyond what's reasonable for the very marginal additional speed gains.

@codecov
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.39%. Comparing base (4963ae7) to head (fda2e37).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #18147   +/-   ##
=======================================
  Coverage   98.39%   98.39%           
=======================================
  Files         171      171           
  Lines       22290    22291    +1     
=======================================
+ Hits        21933    21934    +1     
  Misses        357      357           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link

github-actions bot commented Sep 26, 2025

Code size report:

  mpy-cross:   +96 +0.025% 
   bare-arm:   -56 -0.099% 
minimal x86:  +150 +0.080% 
   unix x64:  +160 +0.019% standard
      stm32:   -52 -0.013% PYBV10
     mimxrt:   -48 -0.013% TEENSY40
        rp2:   -64 -0.007% RPI_PICO_W
       samd:   -56 -0.021% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   -27 -0.006% VIRT_RV32

@AJMansfield AJMansfield force-pushed the buffer-padding-on-stack branch 3 times, most recently from e0121d9 to 4c8dd39 Compare September 26, 2025 17:59
@AJMansfield AJMansfield force-pushed the buffer-padding-on-stack branch 2 times, most recently from 01459a8 to 91ef7c6 Compare September 26, 2025 19:37
@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Sep 30, 2025
@dpgeorge
Copy link
Member

dpgeorge commented Oct 4, 2025

I like this PR in that it reduces code size (well, on non-x86 archs...), and has performance measurements to also back it up.

I have a feeling that the performance improvement won't be as much on targets that don't rely heavily on caching external SPI flash, but I'd need to test it myself.

My main concern with this PR is that it clashes with #18092. If we consider this PR as the "optimization" and the other as the "fix", then this optimization is perhaps premature. It's very easy to optimize code that is wrong if it doesn't matter that it stays wrong (although here the situation is more subtle because the code is mostly right and the optimization here still passes all existing tests).

So I'm on the fence whether it's better to optimize first, then try to fix. Or fix, then try to optimize. The issue with optimizing first is that the optimization may need to be undone to apply the fix. Or vice versa, once the fix goes in the optimization may no longer work.

@robert-hh
Copy link
Contributor

IMHO correctness beats speed. Means: first fix, then optimize.

@AJMansfield
Copy link
Contributor Author

I have a feeling that the performance improvement won't be as much on targets that don't rely heavily on caching external SPI flash, but I'd need to test it myself.

Good call to test that! If there's a chip where ROM is actually faster than RAM, this could still be slower.

On the RP2 boards where I've been testing, though, XIP cache hits are single-cycle just like SRAM --- so under the speculative assumption that its 16kB is enough to fit the full set of code and data used by the outer test loop, in theory the cache lines for the original version's buffers should be staying hot for the entire test run and have equivalent access costs.

My main concern with this PR is that it clashes with #18092. If we consider this PR as the "optimization" and the other as the "fix", then this optimization is perhaps premature. It's very easy to optimize code that is wrong if it doesn't matter that it stays wrong (although here the situation is more subtle because the code is mostly right and the optimization here still passes all existing tests).

So I'm on the fence whether it's better to optimize first, then try to fix. Or fix, then try to optimize. The issue with optimizing first is that the optimization may need to be undone to apply the fix. Or vice versa, once the fix goes in the optimization may no longer work.

IMHO correctness beats speed. Means: first fix, then optimize.

Agreed! The argument I'd put for going the other way though is simply that the logic for implementing #18092 gets dramatically simpler when you have a writeable padding buffer like this PR adds.

In fact, that PR was the original reason for looking into this --- it just turned out that after rebasing to factor out 'padding buffer on stack' as a common ancestor to the various micro-optimisation sets I've been testing against each other, it's still a performance advantage even all on its own.

Feel free to merge #18092 first if that's preferred, and I'll polish up and add one of my commits to this PR that implements the grouping separators --- probably AJMansfield@412aaad from my leading-zeros-alt2 branch or similar.

I can confirm, based on my preliminary tests, that doing the grouping on top of this change has an even larger performance margin over #18092 / 068e110 than the current grouping-less 91ef7c6 has over the current grouping-less master --- I've just not gone to the effort yet of fully rebasing it out as a separate change, and crossing all the t's and dotting all the i's on its benchmarks, since there's still one more micro-optimisation I'm investigating. (Specifically, figuring out how explicit I have to be to get the compiler to unroll and reorder the comma vs underscore cases to make the memset operation into some unrolled assignments of "000," etc for the comma case.)

@dpgeorge
Copy link
Member

dpgeorge commented Oct 6, 2025

Feel free to merge #18092 first if that's preferred, and I'll polish up and add one of my commits to this PR that implements the grouping separators

OK, let's do that, merge the fix #18092 first. Then follow up work can optimize it for code size.

This reworks `mp_print_strn` to use a stack-allocated padding buffer
rather than special-cased hardcoded ROM strings in order to reduce
code size and improve string formatting performance.

Note that this is actually just as performant, even for zeroes and
spaces! On my RP2350 Cortex M33 hardware, spaces are about 1% faster
for short-padding cases, and 3.4% faster for long-padding cases.

I've done some cursory tests for alternate values of `PAD_BUF_SIZE`, but
the results definitely won't generalize to other architectures, and
probably not even to other implementations of the same architecture.
The buffer size of 20 is chosen as the smallest size that divides into
both stride 4 (the comma case) and stride 5 (the underscore case), to
avoid pessimizing the short-padding cases any more than required.

I've also explored alternatives involving using `alloca` for the padding
buffer, but the conditionals and fallback logic needed to bound stack
usage for the pathological cases end up pessimizing code size beyond
what's reasonable for the very marginal additional speed gains.

Signed-off-by: Anson Mansfield <[email protected]>
@AJMansfield AJMansfield force-pushed the buffer-padding-on-stack branch 2 times, most recently from eb173b4 to dfbe3db Compare October 6, 2025 21:48
@AJMansfield
Copy link
Contributor Author

Alright, I'm finally out of micro-optimization ideas, here's my final proposal.

Benchmark Results: Unix, 0a19224 (original)
internal_bench/strformat:
    1.441s (+00.00%) internal_bench/strformat-1-int.py
    1.458s (+01.20%) internal_bench/strformat-2.1.1-int-pad5-space.py
    1.772s (+22.98%) internal_bench/strformat-2.1.2-int-pad5-unusual.py
    1.674s (+16.15%) internal_bench/strformat-2.1.3-int-pad5-center.py
    1.622s (+12.52%) internal_bench/strformat-2.1.4-int-pad5-left.py
    1.640s (+13.83%) internal_bench/strformat-2.2.1-int-pad20-space.py
    1.730s (+20.06%) internal_bench/strformat-2.2.2-int-pad20-unusual.py
    1.816s (+26.01%) internal_bench/strformat-2.2.3-int-pad20-center.py
    1.794s (+24.49%) internal_bench/strformat-2.2.4-int-pad20-left.py
    2.167s (+50.38%) internal_bench/strformat-2.3.1-int-pad100-space.py
    3.463s (+140.30%) internal_bench/strformat-2.3.2-int-pad100-unusual.py
    3.429s (+137.94%) internal_bench/strformat-2.3.3-int-pad100-center.py
    3.554s (+146.58%) internal_bench/strformat-2.3.4-int-pad100-left.py
    5.266s (+265.37%) internal_bench/strformat-2.4.1-int-pad500-space.py
    11.768s (+716.54%) internal_bench/strformat-2.4.2-int-pad500-unusual.py
    11.596s (+704.64%) internal_bench/strformat-2.4.3-int-pad500-center.py
    11.297s (+683.88%) internal_bench/strformat-2.4.4-int-pad500-left.py
1 tests performed (17 individual testcases)
Benchmark Results: RP2, 0a19224 (original)
internal_bench/strformat:
    0.709s (+00.00%) internal_bench/strformat-1-int.py
    0.719s (+01.46%) internal_bench/strformat-2.1.1-int-pad5-space.py
    0.720s (+01.47%) internal_bench/strformat-2.1.2-int-pad5-unusual.py
    0.722s (+01.84%) internal_bench/strformat-2.1.3-int-pad5-center.py
    0.720s (+01.47%) internal_bench/strformat-2.1.4-int-pad5-left.py
    0.775s (+09.34%) internal_bench/strformat-2.2.1-int-pad20-space.py
    0.843s (+18.91%) internal_bench/strformat-2.2.2-int-pad20-unusual.py
    0.843s (+18.83%) internal_bench/strformat-2.2.3-int-pad20-center.py
    0.843s (+18.87%) internal_bench/strformat-2.2.4-int-pad20-left.py
    0.991s (+39.71%) internal_bench/strformat-2.3.1-int-pad100-space.py
    1.424s (+100.79%) internal_bench/strformat-2.3.2-int-pad100-unusual.py
    1.418s (+99.96%) internal_bench/strformat-2.3.3-int-pad100-center.py
    1.413s (+99.27%) internal_bench/strformat-2.3.4-int-pad100-left.py
    2.304s (+224.88%) internal_bench/strformat-2.4.1-int-pad500-space.py
    4.684s (+560.57%) internal_bench/strformat-2.4.2-int-pad500-unusual.py
    4.652s (+555.99%) internal_bench/strformat-2.4.3-int-pad500-center.py
    4.620s (+551.51%) internal_bench/strformat-2.4.4-int-pad500-left.py
1 tests performed (17 individual testcases)
Benchmark Results: Unix, dfbe3db (updated)
internal_bench/strformat:
    1.554s (+00.00%) internal_bench/strformat-1-int.py
    1.549s (-00.31%) internal_bench/strformat-2.1.1-int-pad5-space.py
    1.567s (+00.85%) internal_bench/strformat-2.1.2-int-pad5-unusual.py
    1.567s (+00.85%) internal_bench/strformat-2.1.3-int-pad5-center.py
    1.532s (-01.42%) internal_bench/strformat-2.1.4-int-pad5-left.py
    1.680s (+08.09%) internal_bench/strformat-2.2.1-int-pad20-space.py
    1.632s (+05.03%) internal_bench/strformat-2.2.2-int-pad20-unusual.py
    1.694s (+08.98%) internal_bench/strformat-2.2.3-int-pad20-center.py
    1.631s (+04.93%) internal_bench/strformat-2.2.4-int-pad20-left.py
    2.213s (+42.43%) internal_bench/strformat-2.3.1-int-pad100-space.py
    2.221s (+42.91%) internal_bench/strformat-2.3.2-int-pad100-unusual.py
    2.280s (+46.70%) internal_bench/strformat-2.3.3-int-pad100-center.py
    2.234s (+43.77%) internal_bench/strformat-2.3.4-int-pad100-left.py
    5.058s (+225.48%) internal_bench/strformat-2.4.1-int-pad500-space.py
    5.091s (+227.59%) internal_bench/strformat-2.4.2-int-pad500-unusual.py
    5.153s (+231.62%) internal_bench/strformat-2.4.3-int-pad500-center.py
    5.073s (+226.44%) internal_bench/strformat-2.4.4-int-pad500-left.py
1 tests performed (17 individual testcases)
Benchmark Results: RP2, dfbe3db (updated)
internal_bench/strformat:
    0.713s (+00.00%) internal_bench/strformat-1-int.py
    0.725s (+01.66%) internal_bench/strformat-2.1.1-int-pad5-space.py
    0.725s (+01.67%) internal_bench/strformat-2.1.2-int-pad5-unusual.py
    0.727s (+01.96%) internal_bench/strformat-2.1.3-int-pad5-center.py
    0.725s (+01.61%) internal_bench/strformat-2.1.4-int-pad5-left.py
    0.789s (+10.61%) internal_bench/strformat-2.2.1-int-pad20-space.py
    0.789s (+10.60%) internal_bench/strformat-2.2.2-int-pad20-unusual.py
    0.795s (+11.39%) internal_bench/strformat-2.2.3-int-pad20-center.py
    0.790s (+10.76%) internal_bench/strformat-2.2.4-int-pad20-left.py
    0.979s (+37.27%) internal_bench/strformat-2.3.1-int-pad100-space.py
    0.979s (+37.30%) internal_bench/strformat-2.3.2-int-pad100-unusual.py
    0.995s (+39.50%) internal_bench/strformat-2.3.3-int-pad100-center.py
    0.991s (+38.98%) internal_bench/strformat-2.3.4-int-pad100-left.py
    2.124s (+197.84%) internal_bench/strformat-2.4.1-int-pad500-space.py
    2.124s (+197.85%) internal_bench/strformat-2.4.2-int-pad500-unusual.py
    2.150s (+201.43%) internal_bench/strformat-2.4.3-int-pad500-center.py
    2.197s (+208.03%) internal_bench/strformat-2.4.4-int-pad500-left.py
1 tests performed (17 individual testcases)

For whatever reason, the performance advantage I was seeing on my grouping implementation seems to have disappeared after rebasing to the current master, and at this point I feel like I've been chasing linker feng-shui ghosts for a while now trying to figure out why. The way the x86 port balloons both in size and gets slightly worse runtime with the same change is a bit odd, but it's at least not more than a tiny fraction slower for the short-padding cases on embedded, and is definitely much smaller.

@AJMansfield AJMansfield force-pushed the buffer-padding-on-stack branch from dfbe3db to fda2e37 Compare October 6, 2025 22:31
@agatti
Copy link
Contributor

agatti commented Oct 6, 2025

Alright, I'm finally out of micro-optimization ideas, here's my final proposal.

Stupid idea: how about keeping a buffer filled with spaces around and use a new one coming from the stack if the fill value is not a space?

I reckon most padding operations are done with either zeroes or spaces anyway, can't tell which one is most popular though, so you may want to s/space/zero/ in the previous sentence. You'll probably increase the footprint on x86 (although this could be mitigated by keeping that buffer into .data and fill it at startup rather than put it into .rodata/.text), but if that's a huge concern users can always use UPX or other executable packers :)

@AJMansfield
Copy link
Contributor Author

Stupid idea: how about keeping a buffer filled with spaces around and use a new one coming from the stack if the fill value is not a space?

Not a stupid idea, though in this case the entire point of this PR is replacing an implementation that did that with an implementation that doesn't, lol. The space savings here are twofold: Not needing the extra .rodata, and not needing all of the extra conditional logic to decide whether or not to use it.

@agatti
Copy link
Contributor

agatti commented Oct 6, 2025

Oh, nevermind then - I guess I missed an important chunk of context here :|

I can think of a couple of ways to probably get it faster at the expense of a smaller footprint saving but you'll have to drop down to asm for that and keep the buffer size a multiple of the machine word (ie. fillreg = zext(fillbyte) * 0x01010101; buf[0] = fillreg; ...; buf[N] = fillreg;).

That's probably adding a bit too much complexity, I guess.

@dpgeorge
Copy link
Member

dpgeorge commented Oct 6, 2025

For whatever reason, the performance advantage I was seeing on my grouping implementation seems to have disappeared after rebasing to the current master, and at this point I feel like I've been chasing linker feng-shui ghosts for a while now trying to figure out why

You're running into the general issue that it's very hard to benchmark things, and caching has a large effect. The rp2 and esp32 ports are well known to have fluctuating performance just based on object layout in a file. Similarly the unix port is hard to get consistent results from.

I did have some success with older STM32 parts, eg STM32F4xx. You need something without I- and D-caches and with high performance internal flash. But even then performance can vary depending on the alignment of instructions (eg 2 byte vs 4 byte).

If you can optimize the code simply to reduce its compiled size, without worrying about performance (because printing has lots of other slow parts, like IO) then that's already a big win and worth merging.

@AJMansfield AJMansfield marked this pull request as draft November 8, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants