qa/tasks: Include stderr on tasks badness check. by chrisphoffman · Pull Request #48539 · ceph/ceph

chrisphoffman · 2022-10-18T23:06:30Z

Badness check silently ignores when file does not exist on tasks ceph, cephadm, and rook. Due to this scenario, cephadm uses incorrect file for check and has false positive.

Fixes: https://tracker.ceph.com/issues/57864
Signed-off-by: Christopher Hoffman [email protected]

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

adk3798 · 2022-12-14T21:58:50Z

did a test run including this. It works properly, only issue is the majority of our tests end up getting marked failed because if at any point during the test there was a health warning or error it shows up in the log. Even if that being there is actually fine and shouldn't be a problem for the test. E.g.

"/var/log/ceph/ab3a2036-7bef-11ed-8443-001a4aab830c/ceph-mon.a.log:2022-12-14T20:47:45.406+0000 7f871cc27700 0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)" in cluster log

during an upgrade test where 1 pg being inactive briefly during upgrade is probably okay given the small number of hosts and OSDs in the test cluster being upgrade.

Either we'd have to get it to exclude health warnings/errors in the check, only care about health warnings/errors that are still present at the end of the test, or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort). Otherwise, at least for now, this is sort of unmergable as it makes most of our teuth tests get marked failed. I do like this though and will hopefully revisit in the future. Just wanted to update on its current effects.

chrisphoffman · 2022-12-14T22:39:06Z

Either we'd have to get it to exclude health warnings/errors in the check, only care about health warnings/errors that are still present at the end of the test, or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort).

Agreed, there is much to consider with this PR. I did a similar exercise with librbd api tests. The ignore list from the librbd api tests we use in other tasks was tested against this PR. The list grew from 5 to 13 to pass. All the additional entries aren't related to librbd api tests.

I like your suggestions on how to address the problem. Another approach could be only looking for badness during timestamps the actual "test" is running.

Currently, the rados suite is being tested against this PR to see additional scope. These data points may be used to consider best route forward.

idryomov

Either we'd have to get it to exclude health warnings/errors in the check

This is what was done for many years before cephadm appeared and what is still done in qa/tasks/ceph.py-based suites so I think it's the way to go. cephadm suite and other suites that either partially or fully switched to cephadm for deploying just need some allowlist updates.

only care about health warnings/errors that are still present at the end of the test

No, such transient warnings/errors are exactly why this check greps the log file instead of just looking at ceph status output at the end of the test. We wanted these to fail the test if they aren't specifically allowed and I don't think that changed.

or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort)

This is probably not realistic, particularly given that some tests do things that trigger warnings/errors on purpose.

github-actions · 2023-06-27T08:02:08Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

idryomov · 2023-06-27T08:59:05Z

@adk3798 Any progress on adjusting/cleaning up the cephadm suite here? Historically all suites allow-listed health warnings that are generated as a result of the tests (i.e. on purpose) and the outcome of the CDM discussion was that cephadm needs to do the same.

adk3798 · 2023-06-27T12:31:08Z

@adk3798 Any progress on adjusting/cleaning up the cephadm suite here? Historically all suites allow-listed health warnings that are generated as a result of the tests (i.e. on purpose) and the outcome of the CDM discussion was that cephadm needs to do the same.

Not a lot. I'll start putting some more focus on it. Didn't realize it was discussed in the CDM.

github-actions · 2023-08-26T13:01:31Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

idryomov · 2023-08-27T17:57:18Z

@adk3798 This has been tagged with wip-adk3-testing for a while, any update?

adk3798 · 2023-08-28T13:19:53Z

@adk3798 This has been tagged with wip-adk3-testing for a while, any update?

I did a couple test runs and started making some ignorelist patches but got distracted with work on nvme-of support at the time. I'll try to get the PR open for that some time this week and start testing with it.

github-actions · 2023-10-27T14:01:21Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2024-12-15T17:01:35Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2025-01-14T18:01:58Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

Make sure that first_in_ceph_log() doesn't return None (which is treated as success/"no badness" by the caller) if the cluster log file is missing. Fixes: https://tracker.ceph.com/issues/57864 Co-authored-by: Ilya Dryomov <[email protected]> Signed-off-by: Christopher Hoffman <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>

idryomov · 2025-01-14T21:14:04Z

Reviving this PR to preserve the change that would have prevented https://tracker.ceph.com/issues/63389.

idryomov · 2025-01-15T08:56:59Z

jenkins test windows

idryomov · 2025-01-15T08:57:25Z

jenkins test make check arm64

idryomov · 2025-01-15T11:06:37Z

jenkins test api

idryomov · 2025-01-15T14:55:59Z

jenkins test api

adk3798 · 2025-01-15T16:20:54Z

Reviving this PR to preserve the change that would have prevented https://tracker.ceph.com/issues/63389.

do we still needs this along with #54312?

idryomov · 2025-01-15T18:41:23Z

do we still needs this along with #54312?

I'd say so. While this PR doesn't change anything for cephadm with #54312 in a visible way, it introduces a preventive measure. If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check. The approach isn't perfect (I'd rather explicitly check whether the log file exists, whether or not it's empty and get grep's exit code instead of just propagating stderr like it's done here), but it's much better than nothing.

idryomov · 2025-01-15T18:42:21Z

While this PR doesn't change anything for cephadm with #54312 in a visible way

It used to, but I addressed that review comment: #48539 (comment)

idryomov · 2025-01-15T18:44:22Z

If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check.

https://tracker.ceph.com/issues/63389 was missed because we didn't have this.

adk3798 · 2025-01-15T18:56:51Z

do we still needs this along with #54312?

I'd say so. While this PR doesn't change anything for cephadm with #54312 in a visible way, it introduces a preventive measure. If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check. The approach isn't perfect (I'd rather explicitly check whether the log file exists, whether or not it's empty and get grep's exit code instead of just propagating stderr like it's done here), but it's much better than nothing.

okay, I'll tag this for an orch run at least and see what happens

idryomov · 2025-01-15T19:52:43Z

okay, I'll tag this for an orch run at least and see what happens

Nothing happening would be expected ;)

I have done some targeted testing of my own, verifying that jobs that don't have any "badness" in the cluster log pass, jobs that do fail and a bogus path to the cluster log file causes all jobs to fail. I'll link the runs in a moment.

adk3798 · 2025-01-16T14:27:53Z

okay, I'll tag this for an orch run at least and see what happens

Nothing happening would be expected ;)

yup, the run I did https://pulpito.ceph.com/adking-2025-01-16_04:32:29-orch:cephadm-wip-adk5-testing-2025-01-15-1437-distro-default-smithi/ looked pretty similar to an orch run on main right now

I have done some targeted testing of my own, verifying that jobs that don't have any "badness" in the cluster log pass, jobs that do fail and a bogus path to the cluster log file causes all jobs to fail. I'll link the runs in a moment.

idryomov · 2025-01-16T18:12:58Z

I have done some targeted testing of my own

I created two dummy jobs where ceph task is used for deploying:

tasks:
- install:
- ceph:
- exec:
    client.0:
      - sleep 10

tasks:
- install:
- ceph:
- exec:
    client.0:
      - ceph osd down 0
      - ceph osd down 1
      - sleep 10

and two corresponding cephadm-based jobs. Here is the run:

https://pulpito.ceph.com/dis-2025-01-14_20:14:54-dummy-wip-dis-testing-distro-default-smithi/

As expected, 1-ceph-success and 3-cephadm-success passed while 2-ceph-osd-down and 4-cephadm-osd-down failed with cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log error.

Here is a rerun after I corrupted the path to the cluster log file in qa/tasks/ceph.py and qa/tasks/cephadm.py:

https://pulpito.ceph.com/dis-2025-01-14_20:53:31-dummy-wip-dis-testing-distro-default-smithi/

As expected, all jobs failed -- this being the point of the PR.

github-actions bot added cephadm rook tests labels Oct 18, 2022

chrisphoffman force-pushed the wip-57864 branch 3 times, most recently from 27e6eb3 to 570ede7 Compare October 25, 2022 19:30

adk3798 added the wip-adk-testing label Dec 7, 2022

chrisphoffman force-pushed the wip-57864 branch from 570ede7 to 8965ca3 Compare December 14, 2022 18:52

adk3798 removed the wip-adk-testing label Dec 14, 2022

idryomov reviewed Dec 15, 2022

View reviewed changes

chrisphoffman mentioned this pull request Jan 3, 2023

qa/suites: Whitelist message when quota reached in rbd api #48288

Closed

14 tasks

github-actions bot added the stale label Jun 27, 2023

idryomov removed the stale label Jun 27, 2023

adk3798 added wip-adk2-testing wip-adk3-testing and removed wip-adk2-testing labels Jun 27, 2023

github-actions bot added the stale label Aug 26, 2023

idryomov removed the stale label Aug 27, 2023

github-actions bot added the stale label Oct 27, 2023

idryomov removed the stale label Oct 30, 2023

adk3798 mentioned this pull request Nov 3, 2023

qa/tasks/cephadm: enable mon_cluster_log_to_file #54312

Merged

github-actions bot added the stale label Dec 15, 2024

github-actions bot closed this Jan 14, 2025

idryomov reopened this Jan 14, 2025

idryomov force-pushed the wip-57864 branch from 8965ca3 to 5d92965 Compare January 14, 2025 21:19

idryomov marked this pull request as ready for review January 14, 2025 21:19

idryomov requested a review from a team as a code owner January 14, 2025 21:19

idryomov removed the stale label Jan 14, 2025

adk3798 added the wip-adk5-testing label Jan 15, 2025

adk3798 approved these changes Jan 16, 2025

View reviewed changes

idryomov approved these changes Jan 16, 2025

View reviewed changes

idryomov merged commit 34b3782 into ceph:main Jan 16, 2025
12 of 14 checks passed

This was referenced Jan 17, 2025

reef: qa/tasks: Include stderr on tasks badness check. #61434

Merged

squid: qa/tasks: Include stderr on tasks badness check. #61435

Merged

idryomov mentioned this pull request Dec 1, 2025

qa: do not report errors on stderr as cluster log findings #66366

Draft

14 tasks

Conversation

chrisphoffman commented Oct 18, 2022

Contribution Guidelines

Checklist

Uh oh!

adk3798 commented Dec 14, 2022

Uh oh!

chrisphoffman commented Dec 14, 2022

Uh oh!

idryomov left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 27, 2023

Uh oh!

idryomov commented Jun 27, 2023

Uh oh!

adk3798 commented Jun 27, 2023

Uh oh!

github-actions bot commented Aug 26, 2023

Uh oh!

idryomov commented Aug 27, 2023

Uh oh!

adk3798 commented Aug 28, 2023

Uh oh!

github-actions bot commented Oct 27, 2023

Uh oh!

github-actions bot commented Dec 15, 2024

Uh oh!

github-actions bot commented Jan 14, 2025

Uh oh!

idryomov commented Jan 14, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

adk3798 commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025

Uh oh!

adk3798 commented Jan 15, 2025

Uh oh!

idryomov commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adk3798 commented Jan 16, 2025

Uh oh!

idryomov commented Jan 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

idryomov commented Jan 15, 2025 •

edited

Loading