qa/tasks: Include stderr on tasks badness check.#48539
Conversation
27e6eb3 to
570ede7
Compare
570ede7 to
8965ca3
Compare
|
did a test run including this. It works properly, only issue is the majority of our tests end up getting marked failed because if at any point during the test there was a health warning or error it shows up in the log. Even if that being there is actually fine and shouldn't be a problem for the test. E.g. during an upgrade test where 1 pg being inactive briefly during upgrade is probably okay given the small number of hosts and OSDs in the test cluster being upgrade. Either we'd have to get it to exclude health warnings/errors in the check, only care about health warnings/errors that are still present at the end of the test, or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort). Otherwise, at least for now, this is sort of unmergable as it makes most of our teuth tests get marked failed. I do like this though and will hopefully revisit in the future. Just wanted to update on its current effects. |
Agreed, there is much to consider with this PR. I did a similar exercise with librbd api tests. The ignore list from the librbd api tests we use in other tasks was tested against this PR. The list grew from 5 to 13 to pass. All the additional entries aren't related to librbd api tests. I like your suggestions on how to address the problem. Another approach could be only looking for badness during timestamps the actual "test" is running. Currently, the rados suite is being tested against this PR to see additional scope. These data points may be used to consider best route forward. |
idryomov
left a comment
There was a problem hiding this comment.
Either we'd have to get it to exclude health warnings/errors in the check
This is what was done for many years before cephadm appeared and what is still done in qa/tasks/ceph.py-based suites so I think it's the way to go. cephadm suite and other suites that either partially or fully switched to cephadm for deploying just need some allowlist updates.
only care about health warnings/errors that are still present at the end of the test
No, such transient warnings/errors are exactly why this check greps the log file instead of just looking at ceph status output at the end of the test. We wanted these to fail the test if they aren't specifically allowed and I don't think that changed.
or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort)
This is probably not realistic, particularly given that some tests do things that trigger warnings/errors on purpose.
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@adk3798 Any progress on adjusting/cleaning up the cephadm suite here? Historically all suites allow-listed health warnings that are generated as a result of the tests (i.e. on purpose) and the outcome of the CDM discussion was that cephadm needs to do the same. |
Not a lot. I'll start putting some more focus on it. Didn't realize it was discussed in the CDM. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
@adk3798 This has been tagged with |
I did a couple test runs and started making some ignorelist patches but got distracted with work on nvme-of support at the time. I'll try to get the PR open for that some time this week and start testing with it. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Make sure that first_in_ceph_log() doesn't return None (which is treated as success/"no badness" by the caller) if the cluster log file is missing. Fixes: https://tracker.ceph.com/issues/57864 Co-authored-by: Ilya Dryomov <[email protected]> Signed-off-by: Christopher Hoffman <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
|
Reviving this PR to preserve the change that would have prevented https://tracker.ceph.com/issues/63389. |
|
jenkins test windows |
|
jenkins test make check arm64 |
|
jenkins test api |
1 similar comment
|
jenkins test api |
do we still needs this along with #54312? |
I'd say so. While this PR doesn't change anything for cephadm with #54312 in a visible way, it introduces a preventive measure. If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check. The approach isn't perfect (I'd rather explicitly check whether the log file exists, whether or not it's empty and get grep's exit code instead of just propagating stderr like it's done here), but it's much better than nothing. |
It used to, but I addressed that review comment: #48539 (comment) |
https://tracker.ceph.com/issues/63389 was missed because we didn't have this. |
okay, I'll tag this for an orch run at least and see what happens |
Nothing happening would be expected ;) I have done some targeted testing of my own, verifying that jobs that don't have any "badness" in the cluster log pass, jobs that do fail and a bogus path to the cluster log file causes all jobs to fail. I'll link the runs in a moment. |
yup, the run I did https://pulpito.ceph.com/adking-2025-01-16_04:32:29-orch:cephadm-wip-adk5-testing-2025-01-15-1437-distro-default-smithi/ looked pretty similar to an orch run on main right now
|
I created two dummy jobs where ceph task is used for deploying: and two corresponding cephadm-based jobs. Here is the run: As expected, Here is a rerun after I corrupted the path to the cluster log file in As expected, all jobs failed -- this being the point of the PR. |
Badness check silently ignores when file does not exist on tasks ceph, cephadm, and rook. Due to this scenario, cephadm uses incorrect file for check and has false positive.
Fixes: https://tracker.ceph.com/issues/57864
Signed-off-by: Christopher Hoffman [email protected]
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows