Revert "qa/tasks/cephadm: enable mon_cluster_log_to_file"#55498
Revert "qa/tasks/cephadm: enable mon_cluster_log_to_file"#55498
Conversation
This reverts commit 822e6b0. This commit causes a lot of `rados/cephadm` tests to fail from warnings in the cluster log that are actually expected in the test environment. Until we have a commit to whitelist these expected warnings, a lot of test batches are blocked. Fixes: https://tracker.ceph.com/issues/64343 Signed-off-by: Laura Flores <[email protected]>
There was a problem hiding this comment.
FWIW I disagree with the revert, and especially linking the revert as the fix for https://tracker.ceph.com/issues/64343. You wrote that it
Should be a relatively straightfoward fix from the rados team
in the original PR. How much time is expected to be saved by reverting instead of putting up a PR with appropriate ignorelists?
If there is a particularly pressing RADOS run that needs to go through, it can be scheduled with a local revert in the suite branch (no rebuild required). This is what we had to do when #47560 turned out to break all suites -- it didn't get reverted despite multiple asks.
|
@idryomov Agreed. I don't think it's the right move to revert a proper fix because is made the QA suite turn red, even if it delays squid. It's better that we have a properly functioning QA suite than we get the release out the door on time. Edit: also agree that in no way is this a fix. It's intentionally breaking QA again to make test suites green when they should be red. That's objectively not a fix. |
|
@idryomov @markhpc I also agree, I do not believe this should be reverted. I think whitelisting vs. fixing tests to work in all environments is an entirely different discussion, as well, but I do not think intentionally disabling the entirety of testing is correct either - it's the reason we discussed enabling this in the first place. |
|
There's no advantage in leaving this merged while the ignore-lists are fixed -- that work can proceed just as easily on a branch. Leaving the suites broken in the mean time will cause one of two behaviors:
Tests are never perfect, but tests that always fail due to something irrelevant are clearly less valuable to developers than tests might miss something. |
@athanatos I might have agreed with you when this was first discovered, except that we left this broken for over a year and it resulted in a critical bug getting through into a release. While it's not good that developers might choose to actively ignore the failures because there are too many false positives, I would argue that it at least it provides some incentive to fix the false positives. With this change reverted, bugs can (and demonstrably have) leaked through QA because there was no indication that there were problems in the first place. |
I don't see why this should be blocking the dev freeze of a new release like, squid. We now know and acknowledge the issue, the commitment to fix the side effects before doing another major release should be good enough to unblock this PR. |
|
Alright, we have:
Shall we take this back to the leads meeting next Wednesday? |
|
I very like #54312; it definitely should be in while the suites simply need to be fixed. |
My worry is about the impact on the capacity of QA in the hottest period, when it's needed the most – just before dev freeze. This freeze is planned on next week. How about reverting now and bringing back as the very first PR, just after cutting squid? |
Ok, so the major driver here seems to be to get the squid freeze (and seemingly the release) out by a specific date regardless whether or not QA is working properly. I would propose that this is the wrong approach. If we are willing to give the community a Squid release that doesn't have properly functioning QA we should be upfront about it. If passing QA is still an important part of validating a Ceph release, we shouldn't intentionally make changes that we know can lead to critical bugs. If each of the leads are willing to go through the QA logs and hand verify the results, ok. We shouldn't misrepresent the testing we are doing though. If we are allowing the QA suite to pass tests that we know could be secretly failing, we need to be upfront about it with our users. |
|
I raised #55507 as an alternative. Working on testing it now, as I would need to verify the yaml synatax. |
I reacted more to how this PR was presented as a fix ( I'd have expected to see something along the lines of "the effort to raise a PR with appropriate ignorelists that could be included in test batches instead of the revert is more than X hours because Y and Z health checks that popped up seem suspicious and need investigation" as a justification. Because if it's just |
No, that's not the intention. Community will get the release tested properly, with the fixed QA – that's the goal everybody agree with. However, multiple paths to achieve this goal exist. Some of them have very negative impact on the dev freeze, while some are free of the unnecessary burden.
By the burden I do not mean fixing the suites – I'm perfectly with that. What I mean is that – because of the noise – some QA runs are being repeated, others require far more time to review. The net result is the same: decrease of the QA capacity at he time it's needed the most – just before the freeze. To summarize: I very like the fix while I dislike the moment of the merge – the same effect (the squid release properly tested) could have been achieved without the impact on dev freeze. |
I agree, the wording is unfortunate – temporary workaround would have fit much better as as it emphasizes the intention of having the fix back. |
|
@rzarzynski I'm far more OK with reverting this if we make re-merging it (prior to release!) a criteria for releasing Squid. I just thought based on your reply that you wanted to wait until after we cut Squid to re-merge it. Sorry if I misunderstood! Having said that, I'm trying to help Laura get the whitelists in order. If her work is successful, I would favor that approach if other folks are in agreement. |
|
@markhpc: yeah, looks we've gotten a little misunderstanding here. Sorry for not being clear!
Yes, if we can have the actual fix reasonably soon, then sure! In case of troubles – let's keep this as a plan B. |
Please take a look at Laura's PR #55507, it looks like it's allowing the failing suites to pass now. Hopefully it should cover most (or all) of the false positives. |
|
Closing in favor of #55507. |
This reverts commit 822e6b0. (#54312)
This commit causes a lot of
rados/cephadmtests to fail from warnings in the cluster log that are actually expected in the test environment. Until we have a commit to whitelist these expected warnings, a lot of test batches are blocked.Getting the commit re-merged along with a whitelisting commit will be a priority.
Fixes: https://tracker.ceph.com/issues/64343
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e