Bug #71478
openrados_api_tests: Assertion 'get() != pointer()' failed
0%
Description
description: crimson-rados/basic/{clusters/fixed-2 crimson-supported-all-distro/centos_latest
crimson_qa_overrides deploy/ceph objectstore/seastore/seastore-rbm tasks/rados_api_tests}
2025-05-27T12:35:10.657 INFO:tasks.workunit.client.0.gibba004.stderr:/opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/unique_ptr.h:724: typename std::add_lvalue_reference<_Tp>::type std::unique_ptr<_Tp [], _Dp>::operator[](std::size_t) const [with _Tp = mempool::shard_t; _Dp = std::default_delete<mempool::shard_t []>; typename std::add_lvalue_reference<_Tp>::type = mempool::shard_t&; std::size_t = long unsigned int]: Assertion 'get() != pointer()' failed. 2025-05-27T12:35:10.699 INFO:tasks.workunit.client.0.gibba004.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 55: 101643 Aborted (core dumped) ceph_test_neorados_$f 2025-05-27T12:35:10.699 INFO:tasks.workunit.client.0.gibba004.stderr:++ cleanup 2025-05-27T12:35:10.699 INFO:tasks.workunit.client.0.gibba004.stderr:++ pkill -P 36814 2025-05-27T12:35:10.749 INFO:tasks.workunit.client.0.gibba004.stderr:++ true 2025-05-27T12:35:10.749 INFO:tasks.workunit.client.0.gibba004.stderr:+ cleanup 2025-05-27T12:35:10.750 INFO:tasks.workunit.client.0.gibba004.stderr:+ pkill -P 36814 2025-05-27T12:35:10.753 DEBUG:teuthology.orchestra.run:got remote process result: 134 2025-05-27T12:35:10.754 INFO:tasks.workunit.client.0.gibba004.stderr:+ true 2025-05-27T12:35:10.754 INFO:tasks.workunit:Stopping ['rados/test.sh --crimson', 'rados/test_pool_quota.sh'] on client.0... 2025-05-27T12:35:10.755 DEBUG:teuthology.orchestra.run.gibba004:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2025-05-27T12:35:10.991 ERROR:teuthology.run_tasks:Saw exception from tasks.
Updated by Aishwarya Mathuria 9 months ago
/a/amathuri-2025-05-27_09:48:24-crimson-rados-main-distro-crimson-gibba/8297162
A similar assert failure: https://tracker.ceph.com/issues/71027
Updated by Matan Breizman 8 months ago
- Status changed from New to Closed
Aishwarya Mathuria wrote in #note-1:
/a/amathuri-2025-05-27_09:48:24-crimson-rados-main-distro-crimson-gibba/8297162
A similar assert failure: https://tracker.ceph.com/issues/71027
Possibly fixed by https://github.com/ceph/ceph/pull/63092 which was merged after the above run, let's reopen otherwise.
Updated by Matan Breizman 8 months ago · Edited
- Status changed from Closed to New
reopening:
2025-07-08T23:30:56.519 INFO:tasks.workunit.client.0.smithi046.stderr:+ ceph_test_neorados_read_operations 2025-07-08T23:30:56.545 INFO:tasks.workunit.client.0.smithi046.stdout:Running main() from gmock_main.cc 2025-07-08T23:30:56.545 INFO:tasks.workunit.client.0.smithi046.stdout:[==========] Running 15 tests from 1 test suite. : 25-07-08T23:31:41.587 INFO:tasks.workunit.client.0.smithi046.stdout:[----------] Global test environment tear-down 2025-07-08T23:31:41.587 INFO:tasks.workunit.client.0.smithi046.stdout:[==========] 15 tests from 1 test suite ran. (45041 ms total) 2025-07-08T23:31:41.588 INFO:tasks.workunit.client.0.smithi046.stdout:[ PASSED ] 15 tests. 2025-07-08T23:31:41.588 INFO:tasks.workunit.client.0.smithi046.stderr:/opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/unique_ptr.h:724: typename std::add_lvalue_reference<_Tp>::type std::unique_ptr<_Tp [], _Dp>::operator[](std::size_t) const [with _Tp = mempool::shard_t; _Dp = std::default_delete<mempool::shard_t []>; typename std::add_lvalue_reference<_Tp>::type = mempool::shard_t&; std::size_t = long unsigned int]: Assertion 'get() != pointer()' failed. 2025-07-08T23:31:42.306 INFO:tasks.workunit.client.0.smithi046.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 55: 85803 Aborted (core dumped) ceph_test_neorados_$f
- Update: from this teuthology log, my understanding is that the 15 tests from NeoRadosReadOps completed and passed, the next test to run in the sequence is actually snapshots
2025-07-08T23:30:56.518 INFO:tasks.workunit.client.0.smithi046.stderr:+ for f in cls cmd handler_error io ec_io list ec_list misc pool read_operations snapshots watch_notify write_operations
will check with the other failure... might be wrong though
Updated by Matan Breizman 8 months ago
- Related to Bug #71027: [arm64] run-cli-tests failing on crushtool due to mempool-related assertion added
Updated by Bill Scales 7 months ago
- Assignee set to Bill Scales
Definitely won't be fixed by the fix to crushtool, it appears to be ceph_test_neorados_read_operations that is crashing in these two cases, however the cause of the crash is probably similar - in the case of crushtool the problem was that the tool was exiting (and hence running destructors) while it still had a RADOS context open with active threads running and accessing the memory that was being destructed.
Quick experiment with ceph_test_neorados_read_operations showed that it had 192 threads running at the time it exited main which looked like 38 instances of a RADOS context plus two other threads. Each of the RADOS contexts has a service thread which will periodically read the mempool stats. One of these threads waking up and trying to grab the stats while the exit destructors are being called will cause this assert.
The test case (probably part of the common infrastructure the test case uses) is leaking RADOS contexts which needs fixing. Also need to look at what flags are being used when the context is created, the service threads are almost certainly unneeded for clients, they are only really intended for long running daemon processes.
Updated by Matan Breizman 3 months ago · Edited
2025-11-11T21:43:33.590 INFO:tasks.workunit.client.0.smithi083.stderr:/opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/unique_ptr.h:724: constexpr typename std::add_lvalue_reference<_Tp>::type std::unique_ptr<_Tp [], _Dp>::operator[](std::size_t) const [with _Tp = mempool::shard_t; _Dp = std::default_delete<mempool::shard_t []>; typename std::add_lvalue_reference<_Tp>::type = mempool::shard_t&; std::size_t = long unsigned int]: Assertion 'get() != pointer()' failed. 2025-11-11T21:43:34.230 INFO:tasks.workunit.client.0.smithi083.stderr:timeout: the monitored command dumped core 2025-11-11T21:43:34.230 INFO:tasks.workunit.client.0.smithi083.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 136: 101373 Aborted timeout $timeout $executable 2025-11-11T21:43:34.230 INFO:tasks.workunit.client.0.smithi083.stderr:+ echo 'ERROR: Test snapshots timed out after 5400 seconds' 2025-11-11T21:43:34.231 INFO:tasks.workunit.client.0.smithi083.stdout:ERROR: Test snapshots timed out after 5400 seconds 2025-11-11T21:43:34.231 INFO:tasks.workunit.client.0.smithi083.stdout:Check the logs for failures in snapshots 2025-11-11T21:43:34.232 INFO:tasks.workunit.client.0.smithi083.stderr:+ echo 'Check the logs for failures in snapshots' 2025-11-11T21:43:34.232 INFO:tasks.workunit.client.0.smithi083.stderr:+ ret=1
I overlooked, this indicates the failures occurred in snapshots, as mentioned above.
Updated by Matan Breizman 3 months ago
- Assignee set to Jose J Palacios Perez
Hey Jose, can you please take a look at this one? Thanks!
Updated by Jose J Palacios Perez 3 months ago
Hi Matan, sure I will take a look asap, from the comments it seems that is a test case in teuthology that needs improving, right? I'll look at the documentation since I am not quite sure what the test suite does, and will be updating the tracker. I am looking at the pointed related issues. I will definitely would ask some questions. Cheers
Updated by Jose J Palacios Perez 3 months ago · Edited
I started looking at /ceph/src/test/neorados/read_operations.cc, and test_neorados.cc
- I am mystified by the start_stop.cc main function, where each {} enclosed block seems to be creating an io_context_pool which is used for the make_with_cct RADOs object next, then putting the thread to sleep for a fixed time each block with its own specific sleep time. I need to check global_init() and common_init_finish() to figure out whether further threads are being created.
- understand what the test does, and
- where it fails
Updated by Jose J Palacios Perez 3 months ago · Edited
- test/neorados/start_stop.cc: this contains the main() function (according to my understanding of the rules in CMakelists.txt for this folder), so adding the same flag CINIT_FLAG_NO_DAEMON_ACTIONS to global_init(). Worth a shot. Unfortunately it might not help since the actual test failing is snapshots, continue looking I am.
- Looking at qa/workunits/rados/test.sh: it seems a bit odd that the script uses the --crimson option to skip the EC tests only, but not to use that same flag to create a Crimson OSD cluster when provided with the option for --vstart, so it might need improving?.
2025-07-08T23:31:42.340 INFO:tasks.workunit:Stopping ['rados/test.sh --crimson', 'rados/test_pool_quota.sh'] on client.0...
Looking at /ceph/src/test/neorados/{snapshot.cc,common_tests.h}
- Managed to build by
ccmake .and enabling WITH_TESTS, thenninja -j20 ceph_test_neorados_snapshots, then run the standalone:# bin/ceph_test_neorados_snapshots --help Running main() from gmock_main.cc This program contains tests written using Google Test. You can use the following command line flags to control its behavior: Test Selection: --gtest_list_tests List the names of all tests instead of running them. The name of TEST(Foo, Bar) is "Foo.Bar". --gtest_filter=POSITIVE_PATTERNS[-NEGATIVE_PATTERNS] Run only the tests whose name matches one of the positive patterns but none of the negative patterns. '?' matches any single character; '*' matches any substring; ':' separates two patterns. --gtest_also_run_disabled_tests Run all disabled tests too. Test Execution: --gtest_repeat=[COUNT] : # bin/ceph_test_neorados_snapshots --gtest_list_tests Running main() from gmock_main.cc NeoRadosSnapshots. SnapList SnapRemove Rollback SnapGetName SnapCreateRemove NeoRadosSelfManagedSnaps. Snap Rollback SnapOverlap Bug11677 OrderSnap ReusePurgedSnap @d9754c08030d:/ceph/build [11:20:33]$ # bin/ceph_test_neorados_snapshots --gtest_break_on_failure Running main() from gmock_main.cc [==========] Running 11 tests from 2 test suites. [----------] Global test environment set-up. [----------] 5 tests from NeoRadosSnapshots [ RUN ] NeoRadosSnapshots.SnapList unknown file: Failure C++ exception with description "Connection timed out [system:110]" thrown in the test body. Trace/breakpoint trap (core dumped) # ls -lht /var/lib/systemd/coredump/ total 109M -rw-r-----. 1 root root 700K Nov 20 11:30 core.ceph_test_neora.0.fb62b985edb14266b3de32817af88195.647877.1763638223000000.zst
I'm guessing this needs a cluster running so the tests communicate with it?
Yes, this shows a clean run with a single OSD cluster, single reactor (default) backend Seastore:
# MDS=0 MON=1 OSD=1 MGR=1 taskset -ac '0-27,56-83' /ceph/src/vstart.sh --new -x --localhost --without-dashboard --redirect-output --seastore --osd-args "--seastore_max_concurrent_transactions=128 --seastore_cachepin_type=LRU" --seastore-devs /dev/nvme9n1p2 --crimson --no-restart
INFO 2025-11-20 11:49:24,505 [shard 0:main] osd - get_early_config: set --thread-affinity 0 --smp 1
start osd.0
osd 0 /ceph/build/bin/crimson-osd --seastore_max_concurrent_transactions=128 --seastore_cachepin_type=LRU -i 0 -c /ceph/build/ceph.conf
OSDs started
PID TTY STAT TIME COMMAND
1 pts/0 Ss 0:17 /bin/bash
2381608 pts/0 S 0:00 /bin/sh /ceph/src/ceph-run --no-restart /ceph/build/bin/ceph-mon -i a -c /ceph/build/ceph.conf -f
2381610 pts/0 Sl 0:00 \_ /ceph/build/bin/ceph-mon -i a -c /ceph/build/ceph.conf -f
2381748 pts/0 S 0:00 /bin/sh /ceph/src/ceph-run --no-restart /ceph/build/bin/ceph-mgr -i x -c /ceph/build/ceph.conf -f
2381752 pts/0 Sl 0:03 \_ /ceph/build/bin/ceph-mgr -i x -c /ceph/build/ceph.conf -f
2382055 pts/0 S 0:00 /bin/sh /ceph/src/ceph-run --no-restart /ceph/build/bin/crimson-osd --seastore_max_concurrent_transactions=128 --seastore_cachepin_type=LRU -i 0 -c /ceph/build/ceph.conf -f
2382058 pts/0 Sl 0:01 \_ /ceph/build/bin/crimson-osd --seastore_max_concurrent_transactions=128 --seastore_cachepin_type=LRU -i 0 -c /ceph/build/ceph.conf -f
@d9754c08030d:/ceph/build
[11:52:24]$ # taskset -acp 2382058
pid 2382058's current affinity list: 0-27,56-83
pid 2382064's current affinity list: 0-27,56-83
pid 2382065's current affinity list: 0-27,56-83
# ps -L -o pid,ppid,tid,comm,psr -p 2382058
PID PPID TID COMMAND PSR
2382058 2382055 2382058 crimson-osd 64
2382058 2382055 2382064 syscall-0 60
2382058 2382055 2382065 crimson-osd 61
# bin/ceph_test_neorados_snapshots --gtest_break_on_failure
Running main() from gmock_main.cc
[==========] Running 11 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from NeoRadosSnapshots
[ RUN ] NeoRadosSnapshots.SnapList
[ OK ] NeoRadosSnapshots.SnapList (4198 ms)
[ RUN ] NeoRadosSnapshots.SnapRemove
[ OK ] NeoRadosSnapshots.SnapRemove (5015 ms)
[ RUN ] NeoRadosSnapshots.Rollback
[ OK ] NeoRadosSnapshots.Rollback (4014 ms)
[ RUN ] NeoRadosSnapshots.SnapGetName
[ OK ] NeoRadosSnapshots.SnapGetName (5017 ms)
[ RUN ] NeoRadosSnapshots.SnapCreateRemove
[ OK ] NeoRadosSnapshots.SnapCreateRemove (7026 ms)
[----------] 5 tests from NeoRadosSnapshots (25273 ms total)
[----------] 6 tests from NeoRadosSelfManagedSnaps
[ RUN ] NeoRadosSelfManagedSnaps.Snap
[ OK ] NeoRadosSelfManagedSnaps.Snap (5015 ms)
[ RUN ] NeoRadosSelfManagedSnaps.Rollback
[ OK ] NeoRadosSelfManagedSnaps.Rollback (6023 ms)
[ RUN ] NeoRadosSelfManagedSnaps.SnapOverlap
[ OK ] NeoRadosSelfManagedSnaps.SnapOverlap (8028 ms)
[ RUN ] NeoRadosSelfManagedSnaps.Bug11677
[ OK ] NeoRadosSelfManagedSnaps.Bug11677 (6025 ms)
[ RUN ] NeoRadosSelfManagedSnaps.OrderSnap
[ OK ] NeoRadosSelfManagedSnaps.OrderSnap (4016 ms)
[ RUN ] NeoRadosSelfManagedSnaps.ReusePurgedSnap
Deleting snap 3 in pool ReusePurgedSnapd9754c08030d-2382093-11.
Waiting for snaps to purge.
[ OK ] NeoRadosSelfManagedSnaps.ReusePurgedSnap (19198 ms)
[----------] 6 tests from NeoRadosSelfManagedSnaps (48308 ms total)
[----------] Global test environment tear-down
[==========] 11 tests from 2 test suites ran. (73582 ms total)
[ PASSED ] 11 tests.
Updated by Jose J Palacios Perez 3 months ago · Edited
rados/test.sh --crimson. Questions:
- if as Bill pointed out the previous test (read_operations) despite completing with success, it manages to leak rados contexts somehow such that causes the cluster to fail, hence the subsequent test (snapshots) fails with a core (which I simulated by running the snapshot standalone without a cluster)
- how could the suite be extended to monitor the status of the cluster before each test? So to attribute correctly the failure (unless I am misunderstanding this issue). In other words, how to know whether the failure occurs in the ceph cluster, which in turn causes the (gtest) snapshots to fail and dump a core?
Mhm, its not the same, the gtest flag --gtest_break_on_failure above actually forced a core dump, but without it and running without a cluster it does not seem to drop a core:
# ../src/stop.sh --crimson
WARNING: crimson-osd still alive after 1 seconds
WARNING: crimson-osd still alive after 2 seconds
WARNING: crimson-osd still alive after 4 seconds
@d9754c08030d:/ceph/build
[14:41:17]$ # bin/ceph_test_neorados_snapshots # w/o break on failure
Running main() from gmock_main.cc
[==========] Running 11 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 5 tests from NeoRadosSnapshots
[ RUN ] NeoRadosSnapshots.SnapList
unknown file: Failure
C++ exception with description "Connection timed out [system:110]" thrown in the test body.
[ FAILED ] NeoRadosSnapshots.SnapList (300005 ms)
[ RUN ] NeoRadosSnapshots.SnapRemove
unknown file: Failure
C++ exception with description "Connection timed out [system:110]" thrown in the test body.
[ FAILED ] NeoRadosSnapshots.SnapRemove (300006 ms)
[ RUN ] NeoRadosSnapshots.Rollback
From the last teuthology run, need to look at the available core to find out more:
TEST_DIR="/a/jjperez-2025-11-11_12:44:02-crimson-rados-wip-perezjos-crimson-only-11-11-2025-PR65726-distro-crimson-debug-smithi/"
* plan is to use the ceph-debug-docker.sh script (crimson flavour?) with the shaman build hash caa2c644d1ac0c549abe7ca1411889b9a12f8da9 to examine the core above with gdb
$ x=8595603; ls $TEST_DIR/$x/remote/*/coredump;
1762897413.101374.core.gz ceph_test_neorados_snapshots
This is the way in which the script finds out about the build, so its similar to the teuthology invocation:
api_url="https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=${FLAVOR}&distros=${distro}/$(arch)&ref=${branch}&sha1=${sha}"
Updated by Jose J Palacios Perez 3 months ago · Edited
No luck:
bin/ceph-debug-docker.sh --no-cache --flavor debug crimson:caa2c644d1ac0c549abe7ca1411889b9a12f8da9 centos:stream9 branch: crimson sha1: caa2c644d1ac0c549abe7ca1411889b9a12f8da9 env: centos:stream9 /tmp/tmp.NwzagYL81s ~ --2025-11-21 11:00:14-- https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=debug&distros=centos/9/x86_64&ref=crimson&sha1=caa2c644d1ac0c549abe7ca1411889b9a12f8da9 Resolving shaman.ceph.com (shaman.ceph.com)... 158.69.76.207 Connecting to shaman.ceph.com (shaman.ceph.com)|158.69.76.207|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2 [application/json] Saving to: ‘STDOUT’ - 100%[=================================================================================================================================================================>] 2 --.-KB/s in 0s 2025-11-21 11:00:14 (5.92 MB/s) - written to stdout [2/2] --2025-11-21 11:00:14-- http://nullrepo/ Resolving nullrepo (nullrepo)... failed: Name or service not known. wget: unable to resolve host address ‘nullrepo’
Updated by Jose J Palacios Perez 3 months ago · Edited
Manage to make it work, its downloading the build and preparing the container:
bin/ceph-debug-docker.sh --no-cache --flavor crimson-debug wip-perezjos-crimson-only-11-11-2025-PR65726:caa2c644d1ac0c549abe7ca1411889b9a12f8da9 centos:stream9
branch: wip-perezjos-crimson-only-11-11-2025-PR65726
sha1: caa2c644d1ac0c549abe7ca1411889b9a12f8da9
env: centos:stream9
/tmp/tmp.rMt93BlVkk ~
--2025-11-21 12:07:34-- https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=crimson-debug&distros=centos/9/x86_64&ref=wip-perezjos-crimson-only-11-11-2025-PR65726&sha1=caa2c644d1ac0c549abe7ca1411889b9a12f8da9
Resolving shaman.ceph.com (shaman.ceph.com)... 158.69.76.207
Connecting to shaman.ceph.com (shaman.ceph.com)|158.69.76.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
:
Complete!
COMMIT jjperez:ceph-ci-wip-perezjos-crimson-only-11-11-2025-PR65726-caa2c644d1ac0c549abe7ca1411889b9a12f8da9-centos-stream9
Successfully tagged localhost/jjperez:ceph-ci-wip-perezjos-crimson-only-11-11-2025-PR65726-caa2c644d1ac0c549abe7ca1411889b9a12f8da9-centos-stream9
c802e61d379c9c3226c16f6373079005dd59bdfe6801459fd3545831270afb01
real 6m4.164s
user 3m56.032s
sys 0m41.689s
~
built image jjperez:ceph-ci-wip-perezjos-crimson-only-11-11-2025-PR65726-caa2c644d1ac0c549abe7ca1411889b9a12f8da9-centos-stream9
podman run -ti -v /teuthology:/teuthology:ro jjperez:ceph-ci-wip-perezjos-crimson-only-11-11-2025-PR65726-caa2c644d1ac0c549abe7ca1411889b9a12f8da9-centos-stream9
[root@2505403cdc80 ~]#
Unfortunately, it did not produce a valid stacktrace:
# gunzip -c $TEST_DIR/1762897413.101374.core.gz > /tmp/core
[root@2505403cdc80 ~]# ls -lht /tmp/core
-rw-r--r--. 1 root root 589M Nov 21 12:27 /tmp/core
[root@2505403cdc80 ~]# gdb $TEST_DIR/ceph_test_neorados_snapshots /tmp/core
GNU gdb (CentOS Stream) 16.3-2.el9
Copyright (C) 2024 Free Software Foundation, Inc.
:
warning: File /usr/lib64/libstdc++.so.6.0.29 doesn't match build-id from core-file during file-backed mapping processing
:
arning: Could not load shared library symbols for 12 libraries, e.g. /lib64/libstdc++.so.6.
Downloading separate debug info for system-supplied DSO at 0x7fff4f583000
Core was generated by `/usr/bin/ceph_test_neorados_snapshots'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f453e68c0fc in ?? ()
[Current thread is 1 (LWP 101413)]
gdb) bt
#0 0x00007f453e68c0fc in ?? ()
#1 0x0000000000000000 in ?? ()
gdb) bt full
#0 0x00007f453e68c0fc in ?? ()
No symbol table info available.
#1 0x0000000000000000 in ?? ()
No symbol table info available.
gutted ...
Updated by Adam Emerson 3 months ago
- Assignee changed from Jose J Palacios Perez to Adam Emerson
Updated by Adam Emerson 3 months ago
- Status changed from New to Fix Under Review
- Backport set to tentacle, squid