Bug #71027
closed[arm64] run-cli-tests failing on crushtool due to mempool-related assertion
0%
Description
occasionally failing in 'make check arm64'
/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t: failed --- /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t +++ /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t.err @@ -3079,3 +3079,6 @@ rule 3 (delltestrule) num_rep 4 result size == 1:\t27/1024 (esc) rule 3 (delltestrule) num_rep 4 result size == 2:\t997/1024 (esc) crushtool successfully built or modified map. Use '-o <file>' to write it out. + /usr/lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:659: typename std::add_lvalue_reference<element_type>::type std::unique_ptr<mempool::shard_t[]>::operator[](size_t) const [_Tp = mempool::shard_t[], _Dp = std::default_delete<mempool::shard_t[]>]: Assertion 'get() != pointer()' failed. + *** Caught signal (Aborted) ** + in thread f9ce09c0ec80 thread_name:service
Updated by Casey Bodley 10 months ago
/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t: failed --- /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t +++ /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t.err @@ -408,6 +408,9 @@ rule 0 had 0/10240 mismatched mappings (0) rule 1 had 0/10240 mismatched mappings (0) maps appear equivalent + /usr/lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:659: typename std::add_lvalue_reference<element_type>::type std::unique_ptr<mempool::shard_t[]>::operator[](size_t) const [_Tp = mempool::shard_t[], _Dp = std::default_delete<mempool::shard_t[]>]: Assertion 'get() != pointer()' failed. + *** Caught signal (Aborted) ** + in thread ffff7c3e1c80 thread_name:service
Updated by Bill Scales 10 months ago
Not reproduced this yet, but saw something similar in a bluestore unittest while making the change - in that case it was a multi-threaded test case that was exiting while threads that were still running tried to access mempool. exit runs the destructors including some mempool destructors causing a use after free bug. Prior to my recent change to add a unique_ptr these silently passed, now they will cause asserts. As this is a timing window between threads it probably won't happen every run.
The crashes from crushtool are happening at the end of the test as it exits, however it does all its work on the main thread, there are two threads created by global/common but these should be idle.
Will investigate further
Updated by Bill Scales 10 months ago
It is a use after free bug:
src/common/ceph_context.cc has a CephContextServiceThread:
void *entry() override
{
while (1) {
std::unique_lock l(_lock);
if (_exit_thread) {
break;
}
if (_cct->_conf->heartbeat_interval) {
auto interval = ceph::make_timespan(_cct->_conf->heartbeat_interval);
_cond.wait_for(l, interval);
} else
_cond.wait(l);
if (_exit_thread) {
break;
}
if (_reopen_logs) {
_cct->_log->reopen_log_file();
_reopen_logs = false;
}
_cct->_heartbeat_map->check_touch_file();
// refresh the perf coutners
_cct->_refresh_perf_values();
}
return NULL;
}
heartbeat_interval defaults to 5 seconds so once every 5 seconds this thread calls _refresh_perf_values() which updates all the mempool counters.
However this thread doesn't get terminated until the CephContext destructor gets called, which never happens in crushtool.
Therefore it is possible that the heartbeat tickoccurs and _refresh_perf_values() gets called just as the main thread in crushtool is exiting and running the mempool destructors. Presumably the timing of arm64 make checks tests makes it more likely that the test has been running 5 seconds??
The logs above have
in thread ffff7c3e1c80 thread_name:service
that confirms its this thread that is asserting.
Why isn't the CephContext destructor being called?
This code in src/tools/crushtool.cc is calling cct->get() which is blocking the destructor from being called. If this line is commented out then the destructor runs (confirmed with gdb).
auto cct = global_init(NULL, empty_args, CEPH_ENTITY_TYPE_CLIENT,
CODE_ENVIRONMENT_UTILITY,
CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
// crushtool times out occasionally when quits. so do not
// release the g_ceph_context.
cct->get();
common_init_finish(g_ceph_context);
https://github.com/ceph/ceph/commit/d305cc51b18cbf4b2757bbacb5d43324461306a9 added this code, but provides no further explanation about the timeout/hang.
Don't really want to remove this without understanding why the timeout/hang occurred. We could stop this thread being created by setting
CINIT_FLAG_NO_DAEMON_ACTIONS on the global_init call, this feels like a bit of a hack but might be the safest fix.
Updated by Casey Bodley 10 months ago
Bill Scales wrote in #note-4:
heartbeat_interval defaults to 5 seconds so once every 5 seconds this thread calls _refresh_perf_values() which updates all the mempool counters.
However this thread doesn't get terminated until the CephContext destructor gets called, which never happens in crushtool.Therefore it is possible that the heartbeat tickoccurs and _refresh_perf_values() gets called just as the main thread in crushtool is exiting and running the mempool destructors. Presumably the timing of arm64 make checks tests makes it more likely that the test has been running 5 seconds??
thanks Bill. do you know what controls the lifetime of these mempools? if they aren't owned by the CephContext, maybe they should be?
Don't really want to remove this without understanding why the timeout/hang occurred. We could stop this thread being created by setting
CINIT_FLAG_NO_DAEMON_ACTIONS on the global_init call, this feels like a bit of a hack but might be the safest fix.
CINIT_FLAG_NO_DAEMON_ACTIONS does sound reasonable for command-line tools like this
Updated by Bill Scales 10 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 63092
- Affected Versions v20.0.0 added
- Component(RADOS) CRUSH added
do you know what controls the lifetime of these mempools? if they aren't owned by the CephContext, maybe they should be?
The mempools are global (shared between instances of CephContext) so shouldn't be owned by the CephContext. We could change the unique_ptr to a
shared_ptr and have CephContext keep a shared pointer to hold a reference count, but that starts becoming a lot of effort just to ensure a small amount of memory is freed at program exit.
I've implemented the CINIT_FLAG_NO_DAEMON_ACTIONS fix, if we see another use after free bug with mempool then I suggest we get rid of the unique_ptr in src/common/mempool.cc and just have a memory leak for the small amount of dynamic memory that is being allocated for the lifetime of the executable.
Updated by Bill Scales 9 months ago
- Backport set to tentacle
Bug will exist in tentacle as well, but not earlier releases
Updated by Casey Bodley 9 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot 9 months ago
- Copied to Backport #71507: tentacle: [arm64] run-cli-tests failing on crushtool due to mempool-related assertion added
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 3bfb9d6c9744a7bd685c18abb9df93188cefae7c
- Fixed In set to v20.3.0-606-g3bfb9d6c974
- Upkeep Timestamp set to 2025-07-08T14:46:30+00:00
Updated by Matan Breizman 8 months ago
- Related to Bug #71478: rados_api_tests: Assertion 'get() != pointer()' failed added
Updated by Matan Breizman 8 months ago
Not sure if related or no, but we had a similar crash in Crimson main tests:
https://tracker.ceph.com/issues/71478#note-3
Updated by Upkeep Bot 7 months ago
- Fixed In changed from v20.3.0-606-g3bfb9d6c974 to v20.3.0-606-g3bfb9d6c9744
- Upkeep Timestamp changed from 2025-07-08T14:46:30+00:00 to 2025-07-14T15:20:45+00:00
Updated by Upkeep Bot 7 months ago
- Fixed In changed from v20.3.0-606-g3bfb9d6c9744 to v20.3.0-606-g3bfb9d6c97
- Upkeep Timestamp changed from 2025-07-14T15:20:45+00:00 to 2025-07-14T20:45:20+00:00
Updated by Upkeep Bot 7 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp changed from 2025-07-14T20:45:20+00:00 to 2025-08-02T22:22:01+00:00