Revert "qa/tasks/cephadm: enable mon_cluster_log_to_file" by ljflores · Pull Request #55498 · ceph/ceph

ljflores · 2024-02-08T15:20:00Z

This reverts commit 822e6b0. (#54312)

This commit causes a lot of rados/cephadm tests to fail from warnings in the cluster log that are actually expected in the test environment. Until we have a commit to whitelist these expected warnings, a lot of test batches are blocked.

Getting the commit re-merged along with a whitelisting commit will be a priority.

Fixes: https://tracker.ceph.com/issues/64343

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

This reverts commit 822e6b0. This commit causes a lot of `rados/cephadm` tests to fail from warnings in the cluster log that are actually expected in the test environment. Until we have a commit to whitelist these expected warnings, a lot of test batches are blocked. Fixes: https://tracker.ceph.com/issues/64343 Signed-off-by: Laura Flores <[email protected]>

idryomov

FWIW I disagree with the revert, and especially linking the revert as the fix for https://tracker.ceph.com/issues/64343. You wrote that it

Should be a relatively straightfoward fix from the rados team

in the original PR. How much time is expected to be saved by reverting instead of putting up a PR with appropriate ignorelists?

If there is a particularly pressing RADOS run that needs to go through, it can be scheduled with a local revert in the suite branch (no rebuild required). This is what we had to do when #47560 turned out to break all suites -- it didn't get reverted despite multiple asks.

markhpc · 2024-02-08T17:01:27Z

@idryomov Agreed. I don't think it's the right move to revert a proper fix because is made the QA suite turn red, even if it delays squid. It's better that we have a properly functioning QA suite than we get the release out the door on time.

Edit: also agree that in no way is this a fix. It's intentionally breaking QA again to make test suites green when they should be red. That's objectively not a fix.

ormandj · 2024-02-08T17:23:16Z

@idryomov @markhpc I also agree, I do not believe this should be reverted. I think whitelisting vs. fixing tests to work in all environments is an entirely different discussion, as well, but I do not think intentionally disabling the entirety of testing is correct either - it's the reason we discussed enabling this in the first place.

athanatos · 2024-02-08T19:31:20Z

There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch.

Leaving the suites broken in the mean time will cause one of two behaviors:

people will test with this change reverted in their test branch
people will ignore failures on the assumption that they are due to this change and not their own branch. This will lead to real bugs being missed.

Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something.

athanatos

See my other comment. My vote, at least, is let the tests be useful while the ignore lists are fixed.

markhpc · 2024-02-08T20:08:23Z

There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch.

Leaving the suites broken in the mean time will cause one of two behaviors:

people will test with this change reverted in their test branch

people will ignore failures on the assumption that they are due to this change and not their own branch. This will lead to real bugs being missed.

Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something.

@athanatos I might have agreed with you when this was first discovered, except that we left this broken for over a year and it resulted in a critical bug getting through into a release.

While it's not good that developers might choose to actively ignore the failures because there are too many false positives, I would argue that it at least it provides some incentive to fix the false positives. With this change reverted, bugs can (and demonstrably have) leaked through QA because there was no indication that there were problems in the first place.

neha-ojha · 2024-02-08T20:18:30Z

There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch.
Leaving the suites broken in the mean time will cause one of two behaviors:

people will test with this change reverted in their test branch

people will ignore failures on the assumption that they are due to this change and not their own branch. This will lead to real bugs being missed.

Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something.

@athanatos I might have agreed with you when this was first discovered, except that we left this broken for over a year and it resulted in a critical bug getting through into a release.

While it's not good that developers might choose to actively ignore the failures because there are too many false positives, I would argue that it at least it provides some incentive to fix the false positives. With this change reverted, bugs can (and demonstrably have) leaked through QA because there was no indication that there were problems in the first place.

I don't see why this should be blocking the dev freeze of a new release like, squid. We now know and acknowledge the issue, the commitment to fix the side effects before doing another major release should be good enough to unblock this PR.

markhpc · 2024-02-08T20:31:59Z

Alright, we have:

Two leads who want to revert.
Two leads and two architects of major ceph deployments for Dan's PR (counting Dan since he wrote the original PR).

Shall we take this back to the leads meeting next Wednesday?

rzarzynski · 2024-02-08T20:40:20Z

I very like #54312; it definitely should be in while the suites simply need to be fixed.
What I dislike is getting it in just right before the freeze, when tons of runs happen and the generated noise lowers the QA capacity.

rzarzynski · 2024-02-08T21:19:30Z

Shall we take this back to the leads meeting next Wednesday?

My worry is about the impact on the capacity of QA in the hottest period, when it's needed the most – just before dev freeze. This freeze is planned on next week.

How about reverting now and bringing back as the very first PR, just after cutting squid?

markhpc · 2024-02-08T21:50:37Z

Shall we take this back to the leads meeting next Wednesday?

My worry is about the impact on the capacity of QA in the hottest period, when it's needed the most – just before dev freeze. This freeze is planned on next week.

How about reverting now and bringing back as the very first PR, just after cutting squid?

Ok, so the major driver here seems to be to get the squid freeze (and seemingly the release) out by a specific date regardless whether or not QA is working properly. I would propose that this is the wrong approach. If we are willing to give the community a Squid release that doesn't have properly functioning QA we should be upfront about it. If passing QA is still an important part of validating a Ceph release, we shouldn't intentionally make changes that we know can lead to critical bugs.

If each of the leads are willing to go through the QA logs and hand verify the results, ok. We shouldn't misrepresent the testing we are doing though. If we are allowing the QA suite to pass tests that we know could be secretly failing, we need to be upfront about it with our users.

ljflores · 2024-02-08T22:19:43Z

I raised #55507 as an alternative. Working on testing it now, as I would need to verify the yaml synatax.

idryomov · 2024-02-08T23:05:54Z

How about reverting now and bringing back as the very first PR, just after cutting squid?

I reacted more to how this PR was presented as a fix (Fixes: https://tracker.ceph.com/issues/64343 in the commit message and linked as such in the tracker as opposed to a comment noting a temporary workaround) than to the revert itself. Also, "until we have a commit to whitelist these expected warnings" doesn't really justify a revert given a history of PRs getting merged, breaking all suites in a similar fashion and NOT getting reverted.

I'd have expected to see something along the lines of "the effort to raise a PR with appropriate ignorelists that could be included in test batches instead of the revert is more than X hours because Y and Z health checks that popped up seem suspicious and need investigation" as a justification. Because if it's just MON_DOWN in tests that take down monitors on purpose, the revert churn isn't worth it IMO.

rzarzynski · 2024-02-09T15:05:41Z

Ok, so the major driver here seems to be to get the squid freeze (and seemingly the release) out by a specific date regardless whether or not QA is working properly.
(emphasis mine)

No, that's not the intention. Community will get the release tested properly, with the fixed QA – that's the goal everybody agree with. However, multiple paths to achieve this goal exist. Some of them have very negative impact on the dev freeze, while some are free of the unnecessary burden.

We could bring back the fix to main and backport to squid with-or-without extra ignorelists just after the freeze;
at the price of some extra coordination we could even bring the fix back as the last PR before cutting squid.

By the burden I do not mean fixing the suites – I'm perfectly with that. What I mean is that – because of the noise – some QA runs are being repeated, others require far more time to review. The net result is the same: decrease of the QA capacity at he time it's needed the most – just before the freeze.

To summarize: I very like the fix while I dislike the moment of the merge – the same effect (the squid release properly tested) could have been achieved without the impact on dev freeze.

rzarzynski · 2024-02-09T15:31:29Z

I reacted more to how this PR was presented as a fix (Fixes: https://tracker.ceph.com/issues/64343 in the commit message and linked as such in the tracker as opposed to a comment noting a temporary workaround) than to the revert itself.

I agree, the wording is unfortunate – temporary workaround would have fit much better as as it emphasizes the intention of having the fix back.

markhpc · 2024-02-09T19:18:41Z

@rzarzynski I'm far more OK with reverting this if we make re-merging it (prior to release!) a criteria for releasing Squid. I just thought based on your reply that you wanted to wait until after we cut Squid to re-merge it. Sorry if I misunderstood!

Having said that, I'm trying to help Laura get the whitelists in order. If her work is successful, I would favor that approach if other folks are in agreement.

rzarzynski · 2024-02-09T22:37:51Z

@markhpc: yeah, looks we've gotten a little misunderstanding here. Sorry for not being clear!

Having said that, I'm trying to help Laura get the whitelists in order. If her work is successful, I would favor that approach if other folks are in agreement.

Yes, if we can have the actual fix reasonably soon, then sure! In case of troubles – let's keep this as a plan B.

markhpc · 2024-02-10T04:53:00Z

Yes, if we can have the actual fix reasonably soon, then sure! In case of troubles – let's keep this as a plan B.

Please take a look at Laura's PR #55507, it looks like it's allowing the failing suites to pass now. Hopefully it should cover most (or all) of the false positives.

ljflores · 2024-02-12T15:43:56Z

Closing in favor of #55507.

github-actions bot added the tests label Feb 8, 2024

ljflores mentioned this pull request Feb 8, 2024

qa/tasks/cephadm: enable mon_cluster_log_to_file #54312

Merged

ljflores added the needs-qa label Feb 8, 2024

yuriw added the wip-yuri2-testing label Feb 8, 2024

idryomov reviewed Feb 8, 2024

View reviewed changes

athanatos self-requested a review February 8, 2024 19:31

athanatos approved these changes Feb 8, 2024

View reviewed changes

neha-ojha approved these changes Feb 8, 2024

View reviewed changes

ljflores mentioned this pull request Feb 8, 2024

qa/suites/orch: whitelist warnings that are expected in test environments #55507

Merged

14 tasks

ljflores closed this Feb 12, 2024

Conversation

ljflores commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

idryomov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markhpc commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ormandj commented Feb 8, 2024

Uh oh!

athanatos commented Feb 8, 2024

Uh oh!

athanatos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markhpc commented Feb 8, 2024

Uh oh!

neha-ojha commented Feb 8, 2024

Uh oh!

markhpc commented Feb 8, 2024

Uh oh!

rzarzynski commented Feb 8, 2024

Uh oh!

rzarzynski commented Feb 8, 2024

Uh oh!

markhpc commented Feb 8, 2024

Uh oh!

ljflores commented Feb 8, 2024

Uh oh!

idryomov commented Feb 8, 2024

Uh oh!

rzarzynski commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Feb 9, 2024

Uh oh!

markhpc commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzarzynski commented Feb 9, 2024

Uh oh!

markhpc commented Feb 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Feb 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ljflores commented Feb 8, 2024 •

edited

Loading

idryomov left a comment •

edited

Loading

markhpc commented Feb 8, 2024 •

edited

Loading

athanatos left a comment •

edited

Loading

rzarzynski commented Feb 9, 2024 •

edited

Loading

markhpc commented Feb 9, 2024 •

edited

Loading

markhpc commented Feb 10, 2024 •

edited

Loading