Bug #71027: [arm64] run-cli-tests failing on crushtool due to mempool-related assertion - RADOS - Ceph

Actions

Copy link

Bug #71027

closed

[arm64] run-cli-tests failing on crushtool due to mempool-related assertion

Added by Casey Bodley 10 months ago. Updated 7 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Bill Scales

Category:

Target version:

% Done:

Source:

Backport:

tentacle

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v20.0.0

ceph-qa-suite:

Component(RADOS):

CRUSH

Pull request ID:

63092

Tags (freeform):

backport_processed

Merge Commit:

3bfb9d6c9744a7bd685c18abb9df93188cefae7c

Fixed In:

v20.3.0-606-g3bfb9d6c97

Released In:

Upkeep Timestamp:

2025-08-02T22:22:01+00:00

Description

occasionally failing in 'make check arm64'

ex https://jenkins.ceph.com/job/ceph-pull-requests-arm64/72746/

/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t: failed
--- /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t
+++ /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/test-map-vary-r-0.t.err
@@ -3079,3 +3079,6 @@
   rule 3 (delltestrule) num_rep 4 result size == 1:\t27/1024 (esc)
   rule 3 (delltestrule) num_rep 4 result size == 2:\t997/1024 (esc)
   crushtool successfully built or modified map.  Use '-o <file>' to write it out.
+  /usr/lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:659: typename std::add_lvalue_reference<element_type>::type std::unique_ptr<mempool::shard_t[]>::operator[](size_t) const [_Tp = mempool::shard_t[], _Dp = std::default_delete<mempool::shard_t[]>]: Assertion 'get() != pointer()' failed.
+  *** Caught signal (Aborted) **
+   in thread f9ce09c0ec80 thread_name:service

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Casey Bodley 10 months ago

https://jenkins.ceph.com/job/ceph-pull-requests-arm64/73027/

/home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t: failed
--- /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t
+++ /home/jenkins-build/build/workspace/ceph-pull-requests-arm64/src/test/cli/crushtool/reclassify.t.err
@@ -408,6 +408,9 @@
   rule 0 had 0/10240 mismatched mappings (0)
   rule 1 had 0/10240 mismatched mappings (0)
   maps appear equivalent
+  /usr/lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:659: typename std::add_lvalue_reference<element_type>::type std::unique_ptr<mempool::shard_t[]>::operator[](size_t) const [_Tp = mempool::shard_t[], _Dp = std::default_delete<mempool::shard_t[]>]: Assertion 'get() != pointer()' failed.
+  *** Caught signal (Aborted) **
+   in thread ffff7c3e1c80 thread_name:service

Actions

Copy link

Updated by Casey Bodley 10 months ago

Assignee set to Bill Scales

Actions

Copy link

Updated by Bill Scales 10 months ago

Not reproduced this yet, but saw something similar in a bluestore unittest while making the change - in that case it was a multi-threaded test case that was exiting while threads that were still running tried to access mempool. exit runs the destructors including some mempool destructors causing a use after free bug. Prior to my recent change to add a unique_ptr these silently passed, now they will cause asserts. As this is a timing window between threads it probably won't happen every run.

The crashes from crushtool are happening at the end of the test as it exits, however it does all its work on the main thread, there are two threads created by global/common but these should be idle.

Will investigate further

Actions

Copy link

Updated by Bill Scales 10 months ago

It is a use after free bug:

src/common/ceph_context.cc has a CephContextServiceThread:

  void *entry() override
  {
    while (1) {
      std::unique_lock l(_lock);
      if (_exit_thread) {
        break;
      }

      if (_cct->_conf->heartbeat_interval) {
        auto interval = ceph::make_timespan(_cct->_conf->heartbeat_interval);
        _cond.wait_for(l, interval);
      } else
        _cond.wait(l);

      if (_exit_thread) {
        break;
      }

      if (_reopen_logs) {
        _cct->_log->reopen_log_file();
        _reopen_logs = false;
      }
      _cct->_heartbeat_map->check_touch_file();

      // refresh the perf coutners
      _cct->_refresh_perf_values();
    }
    return NULL;
  }

heartbeat_interval defaults to 5 seconds so once every 5 seconds this thread calls _refresh_perf_values() which updates all the mempool counters.
However this thread doesn't get terminated until the CephContext destructor gets called, which never happens in crushtool.

Therefore it is possible that the heartbeat tickoccurs and _refresh_perf_values() gets called just as the main thread in crushtool is exiting and running the mempool destructors. Presumably the timing of arm64 make checks tests makes it more likely that the test has been running 5 seconds??

The logs above have

in thread ffff7c3e1c80 thread_name:service

that confirms its this thread that is asserting.

Why isn't the CephContext destructor being called?

This code in src/tools/crushtool.cc is calling cct->get() which is blocking the destructor from being called. If this line is commented out then the destructor runs (confirmed with gdb).

  auto cct = global_init(NULL, empty_args, CEPH_ENTITY_TYPE_CLIENT,
                         CODE_ENVIRONMENT_UTILITY,
                         CINIT_FLAG_NO_DEFAULT_CONFIG_FILE);
  // crushtool times out occasionally when quits. so do not
  // release the g_ceph_context.
  cct->get();
  common_init_finish(g_ceph_context);

https://github.com/ceph/ceph/commit/d305cc51b18cbf4b2757bbacb5d43324461306a9 added this code, but provides no further explanation about the timeout/hang.

Don't really want to remove this without understanding why the timeout/hang occurred. We could stop this thread being created by setting
CINIT_FLAG_NO_DAEMON_ACTIONS on the global_init call, this feels like a bit of a hack but might be the safest fix.

Actions

Copy link

Updated by Casey Bodley 10 months ago

Bill Scales wrote in #note-4:

heartbeat_interval defaults to 5 seconds so once every 5 seconds this thread calls _refresh_perf_values() which updates all the mempool counters.
However this thread doesn't get terminated until the CephContext destructor gets called, which never happens in crushtool.

Therefore it is possible that the heartbeat tickoccurs and _refresh_perf_values() gets called just as the main thread in crushtool is exiting and running the mempool destructors. Presumably the timing of arm64 make checks tests makes it more likely that the test has been running 5 seconds??

thanks Bill. do you know what controls the lifetime of these mempools? if they aren't owned by the CephContext, maybe they should be?

Don't really want to remove this without understanding why the timeout/hang occurred. We could stop this thread being created by setting
CINIT_FLAG_NO_DAEMON_ACTIONS on the global_init call, this feels like a bit of a hack but might be the safest fix.

CINIT_FLAG_NO_DAEMON_ACTIONS does sound reasonable for command-line tools like this

Actions

Copy link

Updated by Bill Scales 10 months ago

Status changed from New to Fix Under Review
Pull request ID set to 63092
Affected Versions v20.0.0 added
Component(RADOS) CRUSH added

do you know what controls the lifetime of these mempools? if they aren't owned by the CephContext, maybe they should be?

The mempools are global (shared between instances of CephContext) so shouldn't be owned by the CephContext. We could change the unique_ptr to a
shared_ptr and have CephContext keep a shared pointer to hold a reference count, but that starts becoming a lot of effort just to ensure a small amount of memory is freed at program exit.

I've implemented the CINIT_FLAG_NO_DAEMON_ACTIONS fix, if we see another use after free bug with mempool then I suggest we get rid of the unique_ptr in src/common/mempool.cc and just have a memory leak for the small amount of dynamic memory that is being allocated for the lifetime of the executable.

Actions

Copy link

Updated by Radoslaw Zarzynski 10 months ago

In QA.

Actions

Copy link

Updated by Radoslaw Zarzynski 10 months ago

Still in QA.

Actions

Copy link

Updated by Bill Scales 9 months ago

Backport set to tentacle

Bug will exist in tentacle as well, but not earlier releases

Actions

Copy link

#10

Updated by Casey Bodley 9 months ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#11

Updated by Upkeep Bot 9 months ago

Copied to Backport #71507: tentacle: [arm64] run-cli-tests failing on crushtool due to mempool-related assertion added

Actions

Copy link

#12

Updated by Upkeep Bot 9 months ago

Tags (freeform) set to backport_processed

Actions

Copy link

#13

Updated by Upkeep Bot 8 months ago

Merge Commit set to 3bfb9d6c9744a7bd685c18abb9df93188cefae7c
Fixed In set to v20.3.0-606-g3bfb9d6c974
Upkeep Timestamp set to 2025-07-08T14:46:30+00:00

Actions

Copy link

#14

Updated by Matan Breizman 8 months ago

Related to Bug #71478: rados_api_tests: Assertion 'get() != pointer()' failed added

Actions

Copy link

#15

Updated by Matan Breizman 8 months ago

Not sure if related or no, but we had a similar crash in Crimson main tests:
https://tracker.ceph.com/issues/71478#note-3

Actions

Copy link

#16

Updated by Upkeep Bot 7 months ago

Fixed In changed from v20.3.0-606-g3bfb9d6c974 to v20.3.0-606-g3bfb9d6c9744
Upkeep Timestamp changed from 2025-07-08T14:46:30+00:00 to 2025-07-14T15:20:45+00:00

Actions

Copy link

#17

Updated by Upkeep Bot 7 months ago

Fixed In changed from v20.3.0-606-g3bfb9d6c9744 to v20.3.0-606-g3bfb9d6c97
Upkeep Timestamp changed from 2025-07-14T15:20:45+00:00 to 2025-07-14T20:45:20+00:00

Actions

Copy link

#18

Updated by Upkeep Bot 7 months ago

Status changed from Pending Backport to Resolved
Upkeep Timestamp changed from 2025-07-14T20:45:20+00:00 to 2025-08-02T22:22:01+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #71027

[arm64] run-cli-tests failing on crushtool due to mempool-related assertion

Updated by Casey Bodley 10 months ago

Updated by Casey Bodley 10 months ago

Updated by Bill Scales 10 months ago

Updated by Bill Scales 10 months ago

Updated by Casey Bodley 10 months ago

Updated by Bill Scales 10 months ago

Updated by Radoslaw Zarzynski 10 months ago

Updated by Radoslaw Zarzynski 10 months ago

Updated by Bill Scales 9 months ago

Updated by Casey Bodley 9 months ago

Updated by Upkeep Bot 9 months ago

Updated by Upkeep Bot 9 months ago

Updated by Upkeep Bot 8 months ago

Updated by Matan Breizman 8 months ago

Updated by Matan Breizman 8 months ago

Updated by Upkeep Bot 7 months ago

Updated by Upkeep Bot 7 months ago

Updated by Upkeep Bot 7 months ago