Bug #70700
closedMake check fails to load libceph-common.so.2: file too short
0%
Description
Make check failure:
https://jenkins.ceph.com/job/ceph-pull-requests/154764
243/310 Test #169: test_ceph_argparse.py ............................***Failed 33.22 sec
/home/jenkins-build/build/workspace/ceph-pull-requests/build/bin/get_command_descriptions: error while loading shared libraries: /home/jenkins-build/build/workspace/ceph-pull-requests/build/lib/libceph-common.so.2: file too short
Couldn't parse JSON : Expecting value: line 1 column 1 (char 0)
EInvalid command: missing required parameter entity(<string>)
...
ERROR: test_parse_json_funcsigs (__main__.ParseJsonFuncsigs)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/pybind/test_ceph_argparse.py", line 45, in test_parse_json_funcsigs
cmd_json = parse_json_funcsigs(commands, 'cli')
File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/ceph_argparse.py", line 1006, in parse_json_funcsigs
raise e
File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/ceph_argparse.py", line 1003, in parse_json_funcsigs
overall = json.loads(s)
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Updated by Casey Bodley 11 months ago
- Project changed from mgr to CI
- Tool set to Jenkins builders
i don't think this is just a json/mgr issue. a ton of tests there failed (35 tests failed out of 310), all with errors like this:
error while loading shared libraries: /home/jenkins-build/build/workspace/ceph-pull-requests/build/lib/libceph-common.so.2: file too short
another example with 95 tests failed out of 311 in https://jenkins.ceph.com/job/ceph-pull-requests/154914/
moving to CI project
Updated by Laura Flores 11 months ago
- Subject changed from Make check fails to parse JSON in test_ceph_argparse.py to Make check fails to load shared libraries
Updated by David Galloway 11 months ago
Why is the infrastructure being blamed for this? Does the exact same test pass on a different builder?
Updated by Casey Bodley 11 months ago
Why is the infrastructure being blamed for this? Does the exact same test pass on a different builder?
@David Galloway it's not just a single test failure, many of the tests can't run because the libceph-common.so.2 artifact is apparently corrupted. most PRs don't seem to be effected
i first saw this 2 days ago in https://github.com/ceph/ceph/pull/62515#issuecomment-2754261628 where it went away after jenkins test make check
what the heck happened with 'make check'? https://jenkins.ceph.com/job/ceph-pull-requests/154687/
92 tests failed out of 310
error while loading shared libraries: /home/jenkins-build/build/workspace/ceph-pull-requests/build/lib/libceph-common.so.2: file too short
Updated by Kefu Chai 11 months ago
seen again at https://jenkins.ceph.com/job/ceph-pull-requests/154952/, on 172.21.5.31+adami01
Updated by Casey Bodley 11 months ago
34 tests failed out of 310
error while loading shared libraries: /home/jenkins-build/build/workspace/ceph-pull-requests/build/lib/libceph-common.so.2: file too short
Updated by Casey Bodley 11 months ago
- Subject changed from Make check fails to load shared libraries to Make check fails to load libceph-common.so.2: file too short
Updated by Matt Benjamin 11 months ago
https://github.com/ceph/ceph/pull/61878 failed make check with different files "too short"
Matt
Updated by David Galloway 11 months ago
libceph-common.so.2 is something that is built during the Jenkins job itself, right? Not something permanent on a builder?
It is possible a recent change got merged that is occasionally causing this?
If the jobs were saying something about some `/usr/local/lib` local library, I'd understand why it'd look like the CI.
Updated by David Galloway 11 months ago
Updated by Bill Scales 11 months ago
Another instance. One of the tests smoke.sh times out at 2 hours (it is a victim of this issue) but its log shows that the library was fine at the start of its test (its running ceph-mon and ceph commands without problem) but then a couple of lines later in the script fails to start ceph-mgr. This shows the library is being corrupted while the unittests are running.
# Library is good here: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:550: run_mgr: ceph config set mgr mgr_pool false --force //home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:552: run_mgr: get_asok_path //home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:107: get_asok_path: local name= //home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:108: get_asok_path: '[' -n '' ']' ///home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:111: get_asok_path: get_asok_dir ///home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:99: get_asok_dir: '[' -n '' ']' ///home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:102: get_asok_dir: echo /tmp/ceph-asok.271622 //home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:111: get_asok_path: echo '/tmp/ceph-asok.271622/$cluster-$name.asok' //home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:552: run_mgr: realpath /home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr # Library has been corrupted by here: /home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:552: run_mgr: ceph-mgr --id x --osd-failsafe-full-ratio=.99 --debug-mgr 20 --debug-objecter 20 --debug-ms 20 --debug-paxos 20 --chdir= --mgr-data=td/smoke/x '--log-file=td/smoke/$name.log' '--admin-socket=/tmp/ceph-asok.271622/$cluster-$name.asok' --run-dir=td/smoke '--pid-file=td/smoke/$name.pid' --mgr-module-path=/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr ceph-mgr: error while loading shared libraries: /home/jenkins-build/build/workspace/ceph-pull-requests/build/lib/libceph-common.so.2: file too short /home/jenkins-build/build/workspace/ceph-pull-requests/qa/standalone/ceph-helpers.sh:566: run_mgr: return 1
Updated by Kefu Chai 11 months ago
David Galloway wrote in #note-10:
Could https://github.com/ceph/ceph/pull/62475 have caused it?
hi David, not likely, the worst consequence of the change is FTBFS, not a corrupted or truncated libceph-common.so.2.
Updated by Bill Scales 11 months ago
Failures started in the middle of check-generated.sh in this run, of course the tests that see the error are probably victims of whatever is truncating/overwriting this file rather than the culprit.
Updated by Vallari Agrawal 11 months ago
saw this twice today on same PR:
https://jenkins.ceph.com/job/ceph-pull-requests/155385/ on 172.21.2.17+braggi17
https://jenkins.ceph.com/job/ceph-pull-requests/155368/ on 172.21.3.233+irvingi02
Updated by Casey Bodley 11 months ago
Updated by Casey Bodley 11 months ago
100 tests failed out of 314
Updated by Bill Scales 11 months ago
What can we do to get more details about what is corrupting the library? I'm going to guess that this might not be the only library being corrupted (guess truncated to length 0?), its just the first library that everything is linked against. Can we add something to make checks that does an ls -l and perhaps some shasum's of the build/lib directory before and after trying to run the unit tests. For example looking at libceph-common.so.2, librados.so.2, librbd.so.1 and libcephfs.so.2
We need to try and work out whether this is a unit test somehow managing to destroy a library, or whether its some kind of infrastructure problem, for example perhaps trying to run a compile/link at the same time as the unit tests are running, or a storage problem where the libraries can't be read.
Updated by Vallari Agrawal 11 months ago ยท Edited
I saw this 5 times on "make check" of same PR:
https://jenkins.ceph.com/job/ceph-pull-requests/155940/ 172.21.5.33+adami03
https://jenkins.ceph.com/job/ceph-pull-requests/155930/ 172.21.2.18+braggi18
https://jenkins.ceph.com/job/ceph-pull-requests/155989/ 172.21.5.31+adami01
https://jenkins.ceph.com/job/ceph-pull-requests/156095/ 172.21.5.35+adami05
https://jenkins.ceph.com/job/ceph-pull-requests/156114/ 172.21.2.17+braggi17
Updated by Casey Bodley 10 months ago
Bill Scales wrote in #note-20:
What can we do to get more details about what is corrupting the library? I'm going to guess that this might not be the only library being corrupted (guess truncated to length 0?), its just the first library that everything is linked against. Can we add something to make checks that does an ls -l and perhaps some shasum's of the build/lib directory before and after trying to run the unit tests. For example looking at libceph-common.so.2, librados.so.2, librbd.so.1 and libcephfs.so.2
We need to try and work out whether this is a unit test somehow managing to destroy a library, or whether its some kind of infrastructure problem, for example perhaps trying to run a compile/link at the same time as the unit tests are running, or a storage problem where the libraries can't be read.
thanks @Bill Scales. so far i haven't seen this happen on backport PRs. has anyone else? were this an infrastructure problem, we'd expect to see failures on all release branches
if someone's able to reproduce this with a local 'make check', we could potentially try to bisect. but that's tough without a reliable reproducer
Updated by Casey Bodley 10 months ago
i opened https://github.com/ceph/ceph/pull/62832 with a change to run-make-check.sh that uses chmod to remove write permissions from lib/libceph-common.so.2 in the hopes that it can catch a test case in the act
Updated by Casey Bodley 10 months ago
Casey Bodley wrote in #note-30:
i opened https://github.com/ceph/ceph/pull/62832 with a change to run-make-check.sh that uses chmod to remove write permissions from
lib/libceph-common.so.2in the hopes that it can catch a test case in the act
@Radoslaw Zarzynski those investigations in https://github.com/ceph/ceph/pull/62832#issuecomment-2809428355 point at the new unittest_decode_start_v_checker/unittest_decode_start_v_checker_expect_failure tests which can be seen recompiling libceph-common.so.2 and its dependencies. it's not clear why they need to rebuild, but maybe that target uses different compiler/linker flags?
Updated by Casey Bodley 10 months ago
i've proposed disabling these tests for now in https://github.com/ceph/ceph/pull/62902