-
-
Notifications
You must be signed in to change notification settings - Fork 8.6k
py/mpprint: Use a padding buffer on the stack. #18147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
py/mpprint: Use a padding buffer on the stack. #18147
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18147 +/- ##
=======================================
Coverage 98.39% 98.39%
=======================================
Files 171 171
Lines 22290 22291 +1
=======================================
+ Hits 21933 21934 +1
Misses 357 357 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Code size report: |
e0121d9 to
4c8dd39
Compare
01459a8 to
91ef7c6
Compare
|
I like this PR in that it reduces code size (well, on non-x86 archs...), and has performance measurements to also back it up. I have a feeling that the performance improvement won't be as much on targets that don't rely heavily on caching external SPI flash, but I'd need to test it myself. My main concern with this PR is that it clashes with #18092. If we consider this PR as the "optimization" and the other as the "fix", then this optimization is perhaps premature. It's very easy to optimize code that is wrong if it doesn't matter that it stays wrong (although here the situation is more subtle because the code is mostly right and the optimization here still passes all existing tests). So I'm on the fence whether it's better to optimize first, then try to fix. Or fix, then try to optimize. The issue with optimizing first is that the optimization may need to be undone to apply the fix. Or vice versa, once the fix goes in the optimization may no longer work. |
|
IMHO correctness beats speed. Means: first fix, then optimize. |
Good call to test that! If there's a chip where ROM is actually faster than RAM, this could still be slower. On the RP2 boards where I've been testing, though, XIP cache hits are single-cycle just like SRAM --- so under the speculative assumption that its 16kB is enough to fit the full set of code and data used by the outer test loop, in theory the cache lines for the original version's buffers should be staying hot for the entire test run and have equivalent access costs.
Agreed! The argument I'd put for going the other way though is simply that the logic for implementing #18092 gets dramatically simpler when you have a writeable padding buffer like this PR adds. In fact, that PR was the original reason for looking into this --- it just turned out that after rebasing to factor out 'padding buffer on stack' as a common ancestor to the various micro-optimisation sets I've been testing against each other, it's still a performance advantage even all on its own. Feel free to merge #18092 first if that's preferred, and I'll polish up and add one of my commits to this PR that implements the grouping separators --- probably AJMansfield@412aaad from my I can confirm, based on my preliminary tests, that doing the grouping on top of this change has an even larger performance margin over #18092 / 068e110 than the current grouping-less 91ef7c6 has over the current grouping-less master --- I've just not gone to the effort yet of fully rebasing it out as a separate change, and crossing all the t's and dotting all the i's on its benchmarks, since there's still one more micro-optimisation I'm investigating. (Specifically, figuring out how explicit I have to be to get the compiler to unroll and reorder the comma vs underscore cases to make the memset operation into some unrolled assignments of |
Signed-off-by: Anson Mansfield <[email protected]>
This reworks `mp_print_strn` to use a stack-allocated padding buffer rather than special-cased hardcoded ROM strings in order to reduce code size and improve string formatting performance. Note that this is actually just as performant, even for zeroes and spaces! On my RP2350 Cortex M33 hardware, spaces are about 1% faster for short-padding cases, and 3.4% faster for long-padding cases. I've done some cursory tests for alternate values of `PAD_BUF_SIZE`, but the results definitely won't generalize to other architectures, and probably not even to other implementations of the same architecture. The buffer size of 20 is chosen as the smallest size that divides into both stride 4 (the comma case) and stride 5 (the underscore case), to avoid pessimizing the short-padding cases any more than required. I've also explored alternatives involving using `alloca` for the padding buffer, but the conditionals and fallback logic needed to bound stack usage for the pathological cases end up pessimizing code size beyond what's reasonable for the very marginal additional speed gains. Signed-off-by: Anson Mansfield <[email protected]>
eb173b4 to
dfbe3db
Compare
|
Alright, I'm finally out of micro-optimization ideas, here's my final proposal. Benchmark Results: Unix, 0a19224 (original)Benchmark Results: RP2, 0a19224 (original)Benchmark Results: Unix, dfbe3db (updated)Benchmark Results: RP2, dfbe3db (updated)For whatever reason, the performance advantage I was seeing on my grouping implementation seems to have disappeared after rebasing to the current master, and at this point I feel like I've been chasing linker feng-shui ghosts for a while now trying to figure out why. The way the x86 port balloons both in size and gets slightly worse runtime with the same change is a bit odd, but it's at least not more than a tiny fraction slower for the short-padding cases on embedded, and is definitely much smaller. |
Signed-off-by: Anson Mansfield <[email protected]>
dfbe3db to
fda2e37
Compare
Stupid idea: how about keeping a buffer filled with spaces around and use a new one coming from the stack if the fill value is not a space? I reckon most padding operations are done with either zeroes or spaces anyway, can't tell which one is most popular though, so you may want to s/space/zero/ in the previous sentence. You'll probably increase the footprint on x86 (although this could be mitigated by keeping that buffer into .data and fill it at startup rather than put it into .rodata/.text), but if that's a huge concern users can always use UPX or other executable packers :) |
Not a stupid idea, though in this case the entire point of this PR is replacing an implementation that did that with an implementation that doesn't, lol. The space savings here are twofold: Not needing the extra .rodata, and not needing all of the extra conditional logic to decide whether or not to use it. |
|
Oh, nevermind then - I guess I missed an important chunk of context here :| I can think of a couple of ways to probably get it faster at the expense of a smaller footprint saving but you'll have to drop down to asm for that and keep the buffer size a multiple of the machine word (ie. fillreg = zext(fillbyte) * 0x01010101; buf[0] = fillreg; ...; buf[N] = fillreg;). That's probably adding a bit too much complexity, I guess. |
You're running into the general issue that it's very hard to benchmark things, and caching has a large effect. The rp2 and esp32 ports are well known to have fluctuating performance just based on object layout in a file. Similarly the unix port is hard to get consistent results from. I did have some success with older STM32 parts, eg STM32F4xx. You need something without I- and D-caches and with high performance internal flash. But even then performance can vary depending on the alignment of instructions (eg 2 byte vs 4 byte). If you can optimize the code simply to reduce its compiled size, without worrying about performance (because printing has lots of other slow parts, like IO) then that's already a big win and worth merging. |
Summary
This reworks
mp_print_strnto use a stack-allocated padding buffer rather than special-cased hardcoded ROM strings in order to reduce code size and improve string formatting performance.Note that this is actually just as performant, even for zeroes and spaces! On my RP2350 Cortex M33 hardware, spaces are about 1% faster for short-padding cases, and 3.4% faster for long-padding cases.
Testing
This PR passes all existing string formatting tests. It also includes a number of additional internal_bench benchmarks that verify the claimed performance improvement.
I've run these benchmarks using my Pico2 Cortex M33 @ 300MHz, with the following results:
718b28a
674d78f
Note that the run-to-run variation I've observed testing this on my embedded board is in the sub-10us range; however there's also been another observed variation in the 5-8ms range related to the git repository state (i.e. building from a dirty workspace that still has staged changes leads to benchmark results nearly 8ms faster than building otherwise-identical code after committing those exact changes). Even if a similar variation is affecting this comparison, though, this is still enough to conclude that the performance in the long-padding and unusual-character cases is better, and that the performance for the short-padding cases is not worse.
Trade-offs and Alternatives
I've done some cursory tests for alternate values of
PAD_BUF_SIZE, but the results definitely won't generalize to other architectures, and probably not even to other implementations of the same architecture. The buffer size of 20 is chosen as the smallest size that easily admits a later implementation of #18092 to support padding with grouping characters, while not pessimizing the short-padding cases any more than required.I've also explored alternatives involving using
allocafor the padding buffer, but the conditionals and fallback logic needed to bound stack usage for the pathological cases end up pessimizing code size beyond what's reasonable for the very marginal additional speed gains.