Skip to content

Revert "qa/tasks/cephadm: enable mon_cluster_log_to_file"#55498

Closed
ljflores wants to merge 1 commit intoceph:mainfrom
ljflores:wip-revert-pr-54312
Closed

Revert "qa/tasks/cephadm: enable mon_cluster_log_to_file"#55498
ljflores wants to merge 1 commit intoceph:mainfrom
ljflores:wip-revert-pr-54312

Conversation

@ljflores
Copy link
Member

@ljflores ljflores commented Feb 8, 2024

This reverts commit 822e6b0. (#54312)

This commit causes a lot of rados/cephadm tests to fail from warnings in the cluster log that are actually expected in the test environment. Until we have a commit to whitelist these expected warnings, a lot of test batches are blocked.

Getting the commit re-merged along with a whitelisting commit will be a priority.

Fixes: https://tracker.ceph.com/issues/64343

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

This reverts commit 822e6b0.

This commit causes a lot of `rados/cephadm` tests to fail from warnings in
the cluster log that are actually expected in the test environment.
Until we have a commit to whitelist these expected warnings, a lot of test batches
are blocked.

Fixes: https://tracker.ceph.com/issues/64343
Signed-off-by: Laura Flores <[email protected]>
Copy link
Contributor

@idryomov idryomov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I disagree with the revert, and especially linking the revert as the fix for https://tracker.ceph.com/issues/64343. You wrote that it

Should be a relatively straightfoward fix from the rados team

in the original PR. How much time is expected to be saved by reverting instead of putting up a PR with appropriate ignorelists?

If there is a particularly pressing RADOS run that needs to go through, it can be scheduled with a local revert in the suite branch (no rebuild required). This is what we had to do when #47560 turned out to break all suites -- it didn't get reverted despite multiple asks.

@markhpc
Copy link
Member

markhpc commented Feb 8, 2024

@idryomov Agreed. I don't think it's the right move to revert a proper fix because is made the QA suite turn red, even if it delays squid. It's better that we have a properly functioning QA suite than we get the release out the door on time.

Edit: also agree that in no way is this a fix. It's intentionally breaking QA again to make test suites green when they should be red. That's objectively not a fix.

@ormandj
Copy link
Contributor

ormandj commented Feb 8, 2024

@idryomov @markhpc I also agree, I do not believe this should be reverted. I think whitelisting vs. fixing tests to work in all environments is an entirely different discussion, as well, but I do not think intentionally disabling the entirety of testing is correct either - it's the reason we discussed enabling this in the first place.

@athanatos
Copy link
Contributor

There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch.

Leaving the suites broken in the mean time will cause one of two behaviors:

  1. people will test with this change reverted in their test branch
  2. people will ignore failures on the assumption that they are due to this change and not their own branch. This will lead to real bugs being missed.

Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something.

@athanatos athanatos self-requested a review February 8, 2024 19:31
Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment. My vote, at least, is let the tests be useful while the ignore lists are fixed.

@markhpc
Copy link
Member

markhpc commented Feb 8, 2024

There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch.

Leaving the suites broken in the mean time will cause one of two behaviors:

  1. people will test with this change reverted in their test branch
  2. people will ignore failures on the assumption that they are due to this change and not their own branch. This will lead to real bugs being missed.

Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something.

@athanatos I might have agreed with you when this was first discovered, except that we left this broken for over a year and it resulted in a critical bug getting through into a release.

While it's not good that developers might choose to actively ignore the failures because there are too many false positives, I would argue that it at least it provides some incentive to fix the false positives. With this change reverted, bugs can (and demonstrably have) leaked through QA because there was no indication that there were problems in the first place.

@neha-ojha
Copy link
Member

There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch.
Leaving the suites broken in the mean time will cause one of two behaviors:

  1. people will test with this change reverted in their test branch
  2. people will ignore failures on the assumption that they are due to this change and not their own branch. This will lead to real bugs being missed.

Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something.

@athanatos I might have agreed with you when this was first discovered, except that we left this broken for over a year and it resulted in a critical bug getting through into a release.

While it's not good that developers might choose to actively ignore the failures because there are too many false positives, I would argue that it at least it provides some incentive to fix the false positives. With this change reverted, bugs can (and demonstrably have) leaked through QA because there was no indication that there were problems in the first place.

I don't see why this should be blocking the dev freeze of a new release like, squid. We now know and acknowledge the issue, the commitment to fix the side effects before doing another major release should be good enough to unblock this PR.

@markhpc
Copy link
Member

markhpc commented Feb 8, 2024

Alright, we have:

  1. Two leads who want to revert.
  2. Two leads and two architects of major ceph deployments for Dan's PR (counting Dan since he wrote the original PR).

Shall we take this back to the leads meeting next Wednesday?

@rzarzynski
Copy link
Contributor

I very like #54312; it definitely should be in while the suites simply need to be fixed.
What I dislike is getting it in just right before the freeze, when tons of runs happen and the generated noise lowers the QA capacity.

@rzarzynski
Copy link
Contributor

Shall we take this back to the leads meeting next Wednesday?

My worry is about the impact on the capacity of QA in the hottest period, when it's needed the most – just before dev freeze. This freeze is planned on next week.

How about reverting now and bringing back as the very first PR, just after cutting squid?

@markhpc
Copy link
Member

markhpc commented Feb 8, 2024

Shall we take this back to the leads meeting next Wednesday?

My worry is about the impact on the capacity of QA in the hottest period, when it's needed the most – just before dev freeze. This freeze is planned on next week.

How about reverting now and bringing back as the very first PR, just after cutting squid?

Ok, so the major driver here seems to be to get the squid freeze (and seemingly the release) out by a specific date regardless whether or not QA is working properly. I would propose that this is the wrong approach. If we are willing to give the community a Squid release that doesn't have properly functioning QA we should be upfront about it. If passing QA is still an important part of validating a Ceph release, we shouldn't intentionally make changes that we know can lead to critical bugs.

If each of the leads are willing to go through the QA logs and hand verify the results, ok. We shouldn't misrepresent the testing we are doing though. If we are allowing the QA suite to pass tests that we know could be secretly failing, we need to be upfront about it with our users.

@ljflores
Copy link
Member Author

ljflores commented Feb 8, 2024

I raised #55507 as an alternative. Working on testing it now, as I would need to verify the yaml synatax.

@idryomov
Copy link
Contributor

idryomov commented Feb 8, 2024

How about reverting now and bringing back as the very first PR, just after cutting squid?

I reacted more to how this PR was presented as a fix (Fixes: https://tracker.ceph.com/issues/64343 in the commit message and linked as such in the tracker as opposed to a comment noting a temporary workaround) than to the revert itself. Also, "until we have a commit to whitelist these expected warnings" doesn't really justify a revert given a history of PRs getting merged, breaking all suites in a similar fashion and NOT getting reverted.

I'd have expected to see something along the lines of "the effort to raise a PR with appropriate ignorelists that could be included in test batches instead of the revert is more than X hours because Y and Z health checks that popped up seem suspicious and need investigation" as a justification. Because if it's just MON_DOWN in tests that take down monitors on purpose, the revert churn isn't worth it IMO.

@rzarzynski
Copy link
Contributor

rzarzynski commented Feb 9, 2024

Ok, so the major driver here seems to be to get the squid freeze (and seemingly the release) out by a specific date regardless whether or not QA is working properly.
(emphasis mine)

No, that's not the intention. Community will get the release tested properly, with the fixed QA – that's the goal everybody agree with. However, multiple paths to achieve this goal exist. Some of them have very negative impact on the dev freeze, while some are free of the unnecessary burden.

  1. We could bring back the fix to main and backport to squid with-or-without extra ignorelists just after the freeze;
  2. at the price of some extra coordination we could even bring the fix back as the last PR before cutting squid.

By the burden I do not mean fixing the suites – I'm perfectly with that. What I mean is that – because of the noise – some QA runs are being repeated, others require far more time to review. The net result is the same: decrease of the QA capacity at he time it's needed the most – just before the freeze.

To summarize: I very like the fix while I dislike the moment of the merge – the same effect (the squid release properly tested) could have been achieved without the impact on dev freeze.

@rzarzynski
Copy link
Contributor

I reacted more to how this PR was presented as a fix (Fixes: https://tracker.ceph.com/issues/64343 in the commit message and linked as such in the tracker as opposed to a comment noting a temporary workaround) than to the revert itself.

I agree, the wording is unfortunate – temporary workaround would have fit much better as as it emphasizes the intention of having the fix back.

@markhpc
Copy link
Member

markhpc commented Feb 9, 2024

@rzarzynski I'm far more OK with reverting this if we make re-merging it (prior to release!) a criteria for releasing Squid. I just thought based on your reply that you wanted to wait until after we cut Squid to re-merge it. Sorry if I misunderstood!

Having said that, I'm trying to help Laura get the whitelists in order. If her work is successful, I would favor that approach if other folks are in agreement.

@rzarzynski
Copy link
Contributor

@markhpc: yeah, looks we've gotten a little misunderstanding here. Sorry for not being clear!

Having said that, I'm trying to help Laura get the whitelists in order. If her work is successful, I would favor that approach if other folks are in agreement.

Yes, if we can have the actual fix reasonably soon, then sure! In case of troubles – let's keep this as a plan B.

@markhpc
Copy link
Member

markhpc commented Feb 10, 2024

Yes, if we can have the actual fix reasonably soon, then sure! In case of troubles – let's keep this as a plan B.

Please take a look at Laura's PR #55507, it looks like it's allowing the failing suites to pass now. Hopefully it should cover most (or all) of the false positives.

@ljflores
Copy link
Member Author

Closing in favor of #55507.

@ljflores ljflores closed this Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants