Project

General

Profile

Actions

Bug #71027

closed

[arm64] run-cli-tests failing on crushtool due to mempool-related assertion

Added by Casey Bodley 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
CRUSH
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-606-g3bfb9d6c97
Released In:
Upkeep Timestamp:
2025-08-02T22:22:01+00:00

Description

occasionally failing in 'make check arm64'

ex https://jenkins.ceph.com/job/ceph-pull-requests-arm64/72746/

/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t: failed
--- /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t
+++ /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t.err
@@ -3079,3 +3079,6 @@
   rule 3 (delltestrule) num_rep 4 result size == 1:\t27/1024 (esc)
   rule 3 (delltestrule) num_rep 4 result size == 2:\t997/1024 (esc)
   crushtool successfully built or modified map.  Use '-o <file>' to write it out.
+  /usr/lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:659: typename std::add_lvalue_reference<element_type>::type std::unique_ptr<mempool::shard_t[]>::operator[](size_t) const [_Tp = mempool::shard_t[], _Dp = std::default_delete<mempool::shard_t[]>]: Assertion 'get() != pointer()' failed.
+  *** Caught signal (Aborted) **
+   in thread f9ce09c0ec80 thread_name:service


Related issues 2 (1 open1 closed)

Related to crimson - Bug #71478: rados_api_tests: Assertion 'get() != pointer()' failedFix Under ReviewAdam Emerson

Actions
Copied to RADOS - Backport #71507: tentacle: [arm64] run-cli-tests failing on crushtool due to mempool-related assertionResolvedBill ScalesActions
Actions #1

Updated by Casey Bodley 10 months ago

https://jenkins.ceph.com/job/ceph-pull-requests-arm64/73027/

/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t: failed
--- /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t
+++ /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t.err
@@ -408,6 +408,9 @@
   rule 0 had 0/10240 mismatched mappings (0)
   rule 1 had 0/10240 mismatched mappings (0)
   maps appear equivalent
+  /usr/lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:659: typename std::add_lvalue_reference<element_type>::type std::unique_ptr<mempool::shard_t[]>::operator[](size_t) const [_Tp = mempool::shard_t[], _Dp = std::default_delete<mempool::shard_t[]>]: Assertion 'get() != pointer()' failed.
+  *** Caught signal (Aborted) **
+   in thread ffff7c3e1c80 thread_name:service

Actions #2

Updated by Casey Bodley 10 months ago

  • Assignee set to Bill Scales
Actions #3

Updated by Bill Scales 10 months ago

Not reproduced this yet, but saw something similar in a bluestore unittest while making the change - in that case it was a multi-threaded test case that was exiting while threads that were still running tried to access mempool. exit runs the destructors including some mempool destructors causing a use after free bug. Prior to my recent change to add a unique_ptr these silently passed, now they will cause asserts. As this is a timing window between threads it probably won't happen every run.

The crashes from crushtool are happening at the end of the test as it exits, however it does all its work on the main thread, there are two threads created by global/common but these should be idle.

Will investigate further

Actions #4

Updated by Bill Scales 10 months ago

It is a use after free bug:

src/common/ceph_context.cc has a CephContextServiceThread:

  void *entry() override
  {
    while (1) {
      std::unique_lock l(_lock);
      if (_exit_thread) {
        break;
      }

      if (_cct->_conf->heartbeat_interval) {
        auto interval = ceph::make_timespan(_cct->_conf->heartbeat_interval);
        _cond.wait_for(l, interval);
      } else
        _cond.wait(l);

      if (_exit_thread) {
        break;
      }

      if (_reopen_logs) {
        _cct->_log->reopen_log_file();
        _reopen_logs = false;
      }
      _cct->_heartbeat_map->check_touch_file();

      // refresh the perf coutners
      _cct->_refresh_perf_values();
    }
    return NULL;
  }

heartbeat_interval defaults to 5 seconds so once every 5 seconds this thread calls _refresh_perf_values() which updates all the mempool counters.
However this thread doesn't get terminated until the CephContext destructor gets called, which never happens in crushtool.

Therefore it is possible that the heartbeat tickoccurs and _refresh_perf_values() gets called just as the main thread in crushtool is exiting and running the mempool destructors. Presumably the timing of arm64 make checks tests makes it more likely that the test has been running 5 seconds??

The logs above have

in thread ffff7c3e1c80 thread_name:service

that confirms its this thread that is asserting.

Why isn't the CephContext destructor being called?

This code in src/tools/crushtool.cc is calling cct->get() which is blocking the destructor from being called. If this line is commented out then the destructor runs (confirmed with gdb).

  auto cct = global_init(NULL, empty_args, CEPH_ENTITY_TYPE_CLIENT,
                         CODE_ENVIRONMENT_UTILITY,
                         CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
  // crushtool times out occasionally when quits. so do not
  // release the g_ceph_context.
  cct->get();
  common_init_finish(g_ceph_context);

https://github.com/ceph/ceph/commit/d305cc51b18cbf4b2757bbacb5d43324461306a9 added this code, but provides no further explanation about the timeout/hang.

Don't really want to remove this without understanding why the timeout/hang occurred. We could stop this thread being created by setting
CINIT_FLAG_NO_DAEMON_ACTIONS on the global_init call, this feels like a bit of a hack but might be the safest fix.

Actions #5

Updated by Casey Bodley 10 months ago

Bill Scales wrote in #note-4:

heartbeat_interval defaults to 5 seconds so once every 5 seconds this thread calls _refresh_perf_values() which updates all the mempool counters.
However this thread doesn't get terminated until the CephContext destructor gets called, which never happens in crushtool.

Therefore it is possible that the heartbeat tickoccurs and _refresh_perf_values() gets called just as the main thread in crushtool is exiting and running the mempool destructors. Presumably the timing of arm64 make checks tests makes it more likely that the test has been running 5 seconds??

thanks Bill. do you know what controls the lifetime of these mempools? if they aren't owned by the CephContext, maybe they should be?

Don't really want to remove this without understanding why the timeout/hang occurred. We could stop this thread being created by setting
CINIT_FLAG_NO_DAEMON_ACTIONS on the global_init call, this feels like a bit of a hack but might be the safest fix.

CINIT_FLAG_NO_DAEMON_ACTIONS does sound reasonable for command-line tools like this

Actions #6

Updated by Bill Scales 10 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 63092
  • Affected Versions v20.0.0 added
  • Component(RADOS) CRUSH added

do you know what controls the lifetime of these mempools? if they aren't owned by the CephContext, maybe they should be?

The mempools are global (shared between instances of CephContext) so shouldn't be owned by the CephContext. We could change the unique_ptr to a
shared_ptr and have CephContext keep a shared pointer to hold a reference count, but that starts becoming a lot of effort just to ensure a small amount of memory is freed at program exit.

I've implemented the CINIT_FLAG_NO_DAEMON_ACTIONS fix, if we see another use after free bug with mempool then I suggest we get rid of the unique_ptr in src/common/mempool.cc and just have a memory leak for the small amount of dynamic memory that is being allocated for the lifetime of the executable.

Actions #7

Updated by Radoslaw Zarzynski 10 months ago

In QA.

Actions #8

Updated by Radoslaw Zarzynski 10 months ago

Still in QA.

Actions #9

Updated by Bill Scales 9 months ago

  • Backport set to tentacle

Bug will exist in tentacle as well, but not earlier releases

Actions #10

Updated by Casey Bodley 9 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #11

Updated by Upkeep Bot 9 months ago

  • Copied to Backport #71507: tentacle: [arm64] run-cli-tests failing on crushtool due to mempool-related assertion added
Actions #12

Updated by Upkeep Bot 9 months ago

  • Tags (freeform) set to backport_processed
Actions #13

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 3bfb9d6c9744a7bd685c18abb9df93188cefae7c
  • Fixed In set to v20.3.0-606-g3bfb9d6c974
  • Upkeep Timestamp set to 2025-07-08T14:46:30+00:00
Actions #14

Updated by Matan Breizman 8 months ago

  • Related to Bug #71478: rados_api_tests: Assertion 'get() != pointer()' failed added
Actions #15

Updated by Matan Breizman 8 months ago

Not sure if related or no, but we had a similar crash in Crimson main tests:
https://tracker.ceph.com/issues/71478#note-3

Actions #16

Updated by Upkeep Bot 7 months ago

  • Fixed In changed from v20.3.0-606-g3bfb9d6c974 to v20.3.0-606-g3bfb9d6c9744
  • Upkeep Timestamp changed from 2025-07-08T14:46:30+00:00 to 2025-07-14T15:20:45+00:00
Actions #17

Updated by Upkeep Bot 7 months ago

  • Fixed In changed from v20.3.0-606-g3bfb9d6c9744 to v20.3.0-606-g3bfb9d6c97
  • Upkeep Timestamp changed from 2025-07-14T15:20:45+00:00 to 2025-07-14T20:45:20+00:00
Actions #18

Updated by Upkeep Bot 7 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp changed from 2025-07-14T20:45:20+00:00 to 2025-08-02T22:22:01+00:00
Actions

Also available in: Atom PDF