-
Notifications
You must be signed in to change notification settings - Fork 38.6k
improve MallocUsage() accuracy #28531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/28531. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. LLM Linter (✨ experimental)Possible typos and grammar issues:
drahtbot_id_4_m |
|
The results of the old and new versions of this function should differ only slightly; it would be bad if the new one gave very different results, because node operators might find too much memory being consumed, or (not as bad) not enough. To manually test this, I ran |
|
Changing to draft because CI on a couple of the platforms failed the new test, so the physical memory calculation will need to be different on those platforms, I try to figure that out. |
|
I can reproduce the same results with glibc 2.31 and 2.38, in 32bit and 64bit. I don't think it needs to be perfect on all platforms, I think your new calculation is fine. I played a bit with a reproducer, here is mine: https://godbolt.org/z/s971sbnhG I refactored your |
|
🤔 There hasn't been much activity lately and the CI seems to be failing. If no one reviewed the current pull request by commit hash, a rebase can be considered. While the CI failure may be a false positive, the CI hasn't been running for some time, so there may be a real issue hiding as well. A rebase triggers the latest CI and makes sure that no silent merge conflicts have snuck in. |
b469486 to
04f3fbe
Compare
|
Force-pushed to 04f3fbe308728d2fb18fad418603cab97f0b9f59 - rebase only. (CI is still expected to fail, I'm working on that now.) |
cee653f to
86b71ec
Compare
6bfdad9 to
58ebf90
Compare
|
@martinus On earlier runs, some other tests have failed, but I'm not sure if those are spurious failures. I'm not sure what to do now, any thoughts or suggestions are welcome! |
|
@LarryRuane There is no reason to assume that the memory allocation overhead on those platforms, with substantially different C libraries, would match the formula we inferred on x86_64 and arm Linux. If our goal is actually accurate memory usage calculations on all systems, we will need to derive a formula for each supported system separately. |
58ebf90 to
e0fa518
Compare
|
@sipa - I don't think the test failures could be due to differences in memory allocation overhead across platforms. The actual memory allocation overhead isn't being tested at all. To do that would probably require a test that uses I don't mind giving up and closing this PR, but as @martinus said in an earlier comment, this calculation doesn't need to be perfect, just better. I think this test failure must be due to differences in [UPDATE: the following discussion about I thought I had it figured out; I verified with the debugger that as the test adds the 10k entries to the two maps, it has to repeatedly grow the bucket array. The allocation sizes (in bytes) for the bucket array that I saw here on Ubuntu were: 104, 232, 472, 1026, 2056, 4328, 8872, 18856, 40696, 82184.... The tradeoff one makes when using the pool resource allocator is that memory allocations that are freed but then those same sizes not allocated again, those allocations are, in effect, leaked (until the resource is destroyed). The pool resource works great when freeing and allocating the same sizes (rounded up to the alignment, usually 8 bytes) repeatedly. This is, of course, the case for the individual map entries. So the latest force-push simply calls It's difficult when CI fails on platforms that I don't have access to. Do others have a solution to that? Can I install macOS on Ubuntu or Windows 10 (I have both) as a VM? Maybe I can set up my Windows laptop to build and run the debugger, but that seems like a lot of work. |
|
@LarryRuane The problem here is that the The inconcsistency now comes when using So, with your new MallocUsage the numbers are most likely more correct, but this seems to trigger the incorrect underestimation for |
|
@martinus - thanks, that makes perfect sense! I hadn't thought of the possibility of the nodes having two pointers instead of one (double instead of single linked list). Seems like a bad design, but anyway. I don't think there would be much benefit to fixing |
e0fa518 to
c6cba06
Compare
c3d1f4b to
eec7112
Compare
|
I understand the difficulties of guessing the correct memory usage of such an app (especially when the user cannot even provide total memory usage requirements, just For reference, the commits I used are:
The PR introduces a ~2% slowdownCOMMITS="3f83c744ac28b700090e15b5dda2260724a56f49 4228018ace848adcafd197776cbb4afc2400bf16"; \
STOP=888888; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
--sort command \
--runs 1 \
--export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=10000 -printtoconsole=0; sleep 10" \
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 3f83c744ac28b700090e15b5dda2260724a56f49)
Time (abs ≡): 19573.674 s [User: 35453.443 s, System: 2822.880 s]
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 4228018ace848adcafd197776cbb4afc2400bf16)
Time (abs ≡): 20031.162 s [User: 36761.112 s, System: 2992.418 s]Relative speed comparison
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 3f83c744ac28b700090e15b5dda2260724a56f49)
1.02 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 4228018ace848adcafd197776cbb4afc2400bf16)To understand the performance and memory implications, I ran the same benchmark with Massif profiling, revealing that after this PR we're using ~4% less total memory now 3f83c74 has 790MB peak4228018ace848adcafd197776cbb4afc2400bf16 has 760MB peakIt's not immediately obvious if this is more accurate than before, so I've checked the dbcache related memory usages, which we can see in the logs as:
3f83c74 - 414,449,664B for nodes + 47,738,776B for buckets = ~440MiB dbcache usage4228018ace848adcafd197776cbb4afc2400bf16 - 371,195,904B for nodes + 47,738,776B for buckets = ~400MiB dbcache usageWhich reveals that the the new The actual memory allocated for the bucket arrays of the cache appears identical at peak times. I would expect clang to behave slightly differently - but it takes a week to measure these, so I would like some explanations first before I spend more time re-measuring the scenarios, since it seems that MallocUsage accuracy was already surprisingly accurate. |
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
Co-authored-by: Sebastian Falbesoner <[email protected]> Co-authored-by: Pieter Wuille <[email protected]> Co-authored-by: Martin Leitner-Ankerl <[email protected]> Co-authored-by: l0rinc <[email protected]> # Changes to test/functional/test_framework/mempool_util.py: ## fill_mempool() should call sync_mempools() before returning We saw a case where a test (p2p_1p1c_network.py) called raise_network_minfee(), which called fill_mempool() using node 0. Then raise_network_minfee() returned, and the test called rescan_utxos(), which called getrawmempool() using a different node (node 1) followed by getrawtransaction() on each returned transaction, and the test asserted because a transaction was not found. This was caused by the timing window between the call to getrawmempool() and fetching the individual transactions; the transactions were still being propagated on the P2P network. During this window, a transaction (returned by getrawmempool()) was evicted (the mempool is close to full during this test), and did not exist in the mempool by the time it was attempted to be fetched. It might make more sense for rescan_utxos() to call sync_mempools() just before calling getrawmempool(), but it can't because rescan_utxos() is part of the MiniWallet class, which doesn't have access to test_framework (but that could probably be changed). ## ensure that `fill_mempool` leaves some room in mempool Without this change, fill_mempool() may leave the mempool very close to its memory size limit (-maxmempool). This can cause tests to incorrectly fail when they submit another transaction expecting it to succeed. Note that without this change, the same test that fails on one platform may succeed on another, because their memory allocation accounting algorithms (how they calculate memory usage, that is, MallocUsage()) may be slightly different.
This commit is temporary because it makes one of the new tests fail intentionally so we can gather some information from CI across the platforms. (second attempt, fixes CI error private field not used)
b12d848 to
b36b234
Compare
|
|
||
| #include <boost/test/unit_test.hpp> | ||
|
|
||
| BOOST_AUTO_TEST_SUITE(util_malloc_usage_tests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LarryRuane, most of these tests will be executed on your own fork as well (except a few Cirrus fuzzers).
You can freely experiment there - see for example my attempts at l0rinc#20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that, that will prevent clutter here with that temporary stuff, thanks! I'll do that from now on (I want to do a few more similar experiments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I also have a few experiments where I'm not sure about the state of CI so I push to my local fork first - though the remaining CIs still surprise me sometimes after upstream push :)
|
I wonder if perhaps it would be feasible to run a little runtime self-calibration at startup to find the malloc overhead parameters, so that they would be correct on all platforms. It would make calls to |
|
As @LarryRuane noticed, the new address-diff test sometimes disagrees with I had to read a bit to understand why - I never needed the gory details until now. 🙂 It seem to me that macOS's nano allocator, and the allocators used by TSan/ASan, store per-chunk metadata in shadow tables rather than inline with the user block. For example, this might be related to what we're seeing: https://github.com/aosm/libmalloc/blob/master/src/nano_malloc.c#L181. So maybe we could specialize the tests by platform/compiler/arch/OS (without duplicating I personally would avoid the runtime self-calibration path - rather abstracting away our findings and generalizing it based on self-contained test behavior sounds more predictable to me. If we decide to keep the current settings (even though my massif memory allocation measurements (which should actually measure everything instead of just hoping that it instruments every call) do show it to be less accurate), and if we're still over-counting in the end, we can probably increase the default |
|
Could turn into draft while CI is red? |
After the changes in bitcoin#25325 `getcoinscachesizestate` always end the test early, see: https://maflcko.github.io/b-c-cov/test_bitcoin.coverage/src/test/validation_flush_tests.cpp.gcov.html The test revival was extracted from a related PR where it was discovered, see: bitcoin#28531 (comment) Co-authored-by: Larry Ruane <[email protected]>
After the changes in bitcoin#25325 `getcoinscachesizestate` always end the test early, see: https://maflcko.github.io/b-c-cov/test_bitcoin.coverage/src/test/validation_flush_tests.cpp.gcov.html The test revival was extracted from a related PR where it was discovered, see: bitcoin#28531 (comment) Co-authored-by: Larry Ruane <[email protected]>
|
🐙 This pull request conflicts with the target branch and needs rebase. |
…eSizeState` switches from OK→LARGE→CRITICAL 554befd test: revive `getcoinscachesizestate` (Lőrinc) 64ed0fa refactor: modernize `LargeCoinsCacheThreshold` (Lőrinc) 1b40dc0 refactor: extract `LargeCoinsCacheThreshold` from `GetCoinsCacheSizeState` (Lőrinc) Pull request description: After the changes in #25325 `getcoinscachesizestate` [always ended the test early](https://maflcko.github.io/b-c-cov/test_bitcoin.coverage/src/test/validation_flush_tests.cpp.gcov.html#L65): | File | Line Rate | Line Total | Line Hit | Branch Rate | Branch Total | Branch Hit | |------------------------------|---------:|-----------:|---------:|------------:|-------------:|-----------:| | validation_flush_tests.cpp | **31.5 %** | 54 | 17 | 22.3 % | 242 | 54 | The test revival was [extracted from a related PR](#28531 (comment)) where it was [discovered](#28531 (comment)). ACKs for top commit: achow101: ACK 554befd LarryRuane: ACK 554befd w0xlt: ACK 554befd Tree-SHA512: f5057254de8fb3fa627dd20fee6818cfadeb2e9f629f9972059ad7b32e01fcd7dc9922eff9da2d363b36a9f0954d9bc1c4131d47b2a9c6cc348d9864953b91be
After the changes in bitcoin/bitcoin#25325 `getcoinscachesizestate` always end the test early, see: https://maflcko.github.io/b-c-cov/test_bitcoin.coverage/src/test/validation_flush_tests.cpp.gcov.html The test revival was extracted from a related PR where it was discovered, see: bitcoin/bitcoin#28531 (comment) Co-authored-by: Larry Ruane <[email protected]>

The
MallocUsage()function takes an allocation size as an argument and returns the amount of physical memory consumed, which is greater due to memory allocator overhead and alignment. It was first added in 2015 (first commit of #6102), but its accuracy has degraded as memory allocation libraries have evolved. It's used when it's important that large data structures, such as the coins cache and mempool, should use a predictable, configurable (limited) amount of physical memory (see the-dbcacheand-maxmempoolconfiguration options), as well as a few other places.sipa figured out a concise, efficient expression that this function can use, and that's what's implemented here.
Also add a unit test, which is more helpful than usual in this case since platforms, operating systems, and libraries vary significantly in this area.