Skip to content

qa/tasks: Include stderr on tasks badness check.#48539

Merged
idryomov merged 1 commit intoceph:mainfrom
chrisphoffman:wip-57864
Jan 16, 2025
Merged

qa/tasks: Include stderr on tasks badness check.#48539
idryomov merged 1 commit intoceph:mainfrom
chrisphoffman:wip-57864

Conversation

@chrisphoffman
Copy link
Contributor

Badness check silently ignores when file does not exist on tasks ceph, cephadm, and rook. Due to this scenario, cephadm uses incorrect file for check and has false positive.

Fixes: https://tracker.ceph.com/issues/57864
Signed-off-by: Christopher Hoffman [email protected]

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@adk3798
Copy link
Contributor

adk3798 commented Dec 14, 2022

did a test run including this. It works properly, only issue is the majority of our tests end up getting marked failed because if at any point during the test there was a health warning or error it shows up in the log. Even if that being there is actually fine and shouldn't be a problem for the test. E.g.

"/var/log/ceph/ab3a2036-7bef-11ed-8443-001a4aab830c/ceph-mon.a.log:2022-12-14T20:47:45.406+0000 7f871cc27700 0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)" in cluster log

during an upgrade test where 1 pg being inactive briefly during upgrade is probably okay given the small number of hosts and OSDs in the test cluster being upgrade.

Either we'd have to get it to exclude health warnings/errors in the check, only care about health warnings/errors that are still present at the end of the test, or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort). Otherwise, at least for now, this is sort of unmergable as it makes most of our teuth tests get marked failed. I do like this though and will hopefully revisit in the future. Just wanted to update on its current effects.

@chrisphoffman
Copy link
Contributor Author

Either we'd have to get it to exclude health warnings/errors in the check, only care about health warnings/errors that are still present at the end of the test, or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort).

Agreed, there is much to consider with this PR. I did a similar exercise with librbd api tests. The ignore list from the librbd api tests we use in other tasks was tested against this PR. The list grew from 5 to 13 to pass. All the additional entries aren't related to librbd api tests.

I like your suggestions on how to address the problem. Another approach could be only looking for badness during timestamps the actual "test" is running.

Currently, the rados suite is being tested against this PR to see additional scope. These data points may be used to consider best route forward.

Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we'd have to get it to exclude health warnings/errors in the check

This is what was done for many years before cephadm appeared and what is still done in qa/tasks/ceph.py-based suites so I think it's the way to go. cephadm suite and other suites that either partially or fully switched to cephadm for deploying just need some allowlist updates.

only care about health warnings/errors that are still present at the end of the test

No, such transient warnings/errors are exactly why this check greps the log file instead of just looking at ceph status output at the end of the test. We wanted these to fail the test if they aren't specifically allowed and I don't think that changed.

or clean everything up so that no health warnings are ever raised during the tests (which could be a big effort)

This is probably not realistic, particularly given that some tests do things that trigger warnings/errors on purpose.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jun 27, 2023
@idryomov
Copy link
Contributor

@adk3798 Any progress on adjusting/cleaning up the cephadm suite here? Historically all suites allow-listed health warnings that are generated as a result of the tests (i.e. on purpose) and the outcome of the CDM discussion was that cephadm needs to do the same.

@idryomov idryomov removed the stale label Jun 27, 2023
@adk3798
Copy link
Contributor

adk3798 commented Jun 27, 2023

@adk3798 Any progress on adjusting/cleaning up the cephadm suite here? Historically all suites allow-listed health warnings that are generated as a result of the tests (i.e. on purpose) and the outcome of the CDM discussion was that cephadm needs to do the same.

Not a lot. I'll start putting some more focus on it. Didn't realize it was discussed in the CDM.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Aug 26, 2023
@idryomov idryomov removed the stale label Aug 27, 2023
@idryomov
Copy link
Contributor

@adk3798 This has been tagged with wip-adk3-testing for a while, any update?

@adk3798
Copy link
Contributor

adk3798 commented Aug 28, 2023

@adk3798 This has been tagged with wip-adk3-testing for a while, any update?

I did a couple test runs and started making some ignorelist patches but got distracted with work on nvme-of support at the time. I'll try to get the PR open for that some time this week and start testing with it.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Dec 15, 2024
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Jan 14, 2025
Make sure that first_in_ceph_log() doesn't return None (which is
treated as success/"no badness" by the caller) if the cluster log file
is missing.

Fixes: https://tracker.ceph.com/issues/57864
Co-authored-by: Ilya Dryomov <[email protected]>
Signed-off-by: Christopher Hoffman <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
@idryomov
Copy link
Contributor

Reviving this PR to preserve the change that would have prevented https://tracker.ceph.com/issues/63389.

@idryomov idryomov reopened this Jan 14, 2025
@idryomov idryomov marked this pull request as ready for review January 14, 2025 21:19
@idryomov idryomov requested a review from a team as a code owner January 14, 2025 21:19
@idryomov idryomov removed the stale label Jan 14, 2025
@idryomov
Copy link
Contributor

jenkins test windows

@idryomov
Copy link
Contributor

jenkins test make check arm64

@idryomov
Copy link
Contributor

jenkins test api

1 similar comment
@idryomov
Copy link
Contributor

jenkins test api

@adk3798
Copy link
Contributor

adk3798 commented Jan 15, 2025

Reviving this PR to preserve the change that would have prevented https://tracker.ceph.com/issues/63389.

do we still needs this along with #54312?

@idryomov
Copy link
Contributor

do we still needs this along with #54312?

I'd say so. While this PR doesn't change anything for cephadm with #54312 in a visible way, it introduces a preventive measure. If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check. The approach isn't perfect (I'd rather explicitly check whether the log file exists, whether or not it's empty and get grep's exit code instead of just propagating stderr like it's done here), but it's much better than nothing.

@idryomov
Copy link
Contributor

While this PR doesn't change anything for cephadm with #54312 in a visible way

It used to, but I addressed that review comment: #48539 (comment)

@idryomov
Copy link
Contributor

If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check.

https://tracker.ceph.com/issues/63389 was missed because we didn't have this.

@adk3798
Copy link
Contributor

adk3798 commented Jan 15, 2025

do we still needs this along with #54312?

I'd say so. While this PR doesn't change anything for cephadm with #54312 in a visible way, it introduces a preventive measure. If cephadm (or old-style ceph task, for that matter) stops logging, or the cluster log path changes, etc all jobs would fail on the cluster log badness check. The approach isn't perfect (I'd rather explicitly check whether the log file exists, whether or not it's empty and get grep's exit code instead of just propagating stderr like it's done here), but it's much better than nothing.

okay, I'll tag this for an orch run at least and see what happens

@idryomov
Copy link
Contributor

idryomov commented Jan 15, 2025

okay, I'll tag this for an orch run at least and see what happens

Nothing happening would be expected ;)

I have done some targeted testing of my own, verifying that jobs that don't have any "badness" in the cluster log pass, jobs that do fail and a bogus path to the cluster log file causes all jobs to fail. I'll link the runs in a moment.

@adk3798
Copy link
Contributor

adk3798 commented Jan 16, 2025

okay, I'll tag this for an orch run at least and see what happens

Nothing happening would be expected ;)

yup, the run I did https://pulpito.ceph.com/adking-2025-01-16_04:32:29-orch:cephadm-wip-adk5-testing-2025-01-15-1437-distro-default-smithi/ looked pretty similar to an orch run on main right now

I have done some targeted testing of my own, verifying that jobs that don't have any "badness" in the cluster log pass, jobs that do fail and a bogus path to the cluster log file causes all jobs to fail. I'll link the runs in a moment.

@idryomov
Copy link
Contributor

I have done some targeted testing of my own

I created two dummy jobs where ceph task is used for deploying:

tasks:
- install:
- ceph:
- exec:
    client.0:
      - sleep 10
tasks:
- install:
- ceph:
- exec:
    client.0:
      - ceph osd down 0
      - ceph osd down 1
      - sleep 10

and two corresponding cephadm-based jobs. Here is the run:

https://pulpito.ceph.com/dis-2025-01-14_20:14:54-dummy-wip-dis-testing-distro-default-smithi/

As expected, 1-ceph-success and 3-cephadm-success passed while 2-ceph-osd-down and 4-cephadm-osd-down failed with cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log error.

Here is a rerun after I corrupted the path to the cluster log file in qa/tasks/ceph.py and qa/tasks/cephadm.py:

https://pulpito.ceph.com/dis-2025-01-14_20:53:31-dummy-wip-dis-testing-distro-default-smithi/

As expected, all jobs failed -- this being the point of the PR.

@idryomov idryomov merged commit 34b3782 into ceph:main Jan 16, 2025
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants