mon, qa: issue pool application warning even if pool is empty#47560
mon, qa: issue pool application warning even if pool is empty#47560
Conversation
|
jenkins retest this please |
vumrao
left a comment
There was a problem hiding this comment.
Good catch! thanks, Prashant.
|
jenkins retest this please |
1 similar comment
|
jenkins retest this please |
@tchaikov this is new. I converted all the Jenkins builders to Ubuntu 22.04 now. Do I need to rebuild boost making sure that python3.8 is installed or something? |
|
Pretty sure I fixed the |
|
jenkins retest this please |
|
The |
ahh, I missed checking this time. Thanks for pointing it out, David. @pdvian FYI |
Thanks. Yes, I should have removed this variable. My bad! Corrected it. |
|
Too many teuthology job failures due to this PR : https://pulpito.ceph.com/yuriw-2022-08-16_22:04:30-rados-wip-yuri5-testing-2022-08-16-0859-distro-default-smithi/ |
@neha-ojha What would be right approach to handle RADOS suite failures here ? Should we use |
|
@pdvian does this need a full rados suite? |
Yes @ljflores. I have to go through multiple iterations to fix the relevant teuthology RADOS suite testscases to add POOL_APP_NOT_ENABLED to log-ignorelist. I am working on this in the background and have additional fixes to push to PR. It should be ready by next week. |
|
@pdvian cool, I saw you have your own testing tag on this PR, so I wasn't sure if you were doing your own testing. But just let me know when you want me to get this in one of Yuri's testing batches! |
Ceph status fail to report pool application warning if the pool is empty. Report pool application warning even if pool has 0 objects stored in it. Add POOL_APP_NOT_ENABLED cluster warnings to log-ignorelist to fix rados suite. Fixes: https://tracker.ceph.com/issues/57097 Signed-off-by: Prashant D <[email protected]>
|
jenkins test make check |
|
@cloudbehl @nizamial09 Kindly review the qa/tasks/mgr/dashboard/test_pool.py changes. |
|
@pdvian looks like it's ready to go! |
|
this caused several new |
|
i'll track this in https://tracker.ceph.com/issues/62504. hopefully we can include the rgw fix in your existing reef/quincy backports, or they'll fail rgw qa |
note that one of the new failures (from https://pulpito.ceph.com/cbodley-2023-08-17_21:47:04-rgw-pr-52673-distro-default-smithi/7372307/) doesn't match our existing |
| // required release | ||
| if (sum.num_objects > 0 && pool.application_metadata.empty() && | ||
| !pool.is_tier()) { | ||
| if (pool.application_metadata.empty() && !pool.is_tier()) { |
There was a problem hiding this comment.
So this has obviously created havoc for the other component suites. I think the right thing to do here is:
- revert most of the qa/ suite changes
- add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in
qa/tasks/ceph.conf/template
That config can be selectively turned back on for a dedicated set of tests for this particular change.
There was a problem hiding this comment.
- add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in
qa/tasks/ceph.conf/templateThat config can be selectively turned back on for a dedicated set of tests for this particular change.
@batrick I don't see how that is substantially different from just ignoring the health alert -- instead of the "ignoring" part being conditional, you would be making the raising of the health alert conditional, that's it.
IMO the right way to address this is to make RADOS not raise a bogus health alert independent of the configuration, so that it's not seen by regular users/operators too. It shouldn't be just developers running their tests with a special knob set who get the sane behavior. I suggested the following on the dev list:
[...] perhaps we could treat each new
POOL_APP_NOT_ENABLEDas muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to callrados_application_enable()API.
There was a problem hiding this comment.
- add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in
qa/tasks/ceph.conf/templateThat config can be selectively turned back on for a dedicated set of tests for this particular change.
@batrick I don't see how that is substantially different from just ignoring the health alert -- instead of the "ignoring" part being conditional, you would be making the raising of the health alert conditional, that's it.
The difference is we don't need to play whack-a-mole by adding the warning to ignorelist for dozens of suites. As this PR does. Of course, playing that game means we miss things which is why we're having this discussion :)
IMO the right way to address this is to make RADOS not raise a bogus health alert independent of the configuration, so that it's not seen by regular users/operators too. It shouldn't be just developers running their tests with a special knob set who get the sane behavior. I suggested the following on the dev list:
[...] perhaps we could treat each new
POOL_APP_NOT_ENABLEDas muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to callrados_application_enable()API.
That also seems reasonable but I think it'll still break many QA tests.
There was a problem hiding this comment.
The difference is we don't need to play whack-a-mole by adding the warning to ignorelist for dozens of suites.
I agree that the config opt would be basically an ingorelist at the C++ level. I would prefer to not put extra, QA-driven complexity here.
If ignoring is the way, then we should consider sharing log-ignorelist across the suites. At first glance it looks like a change in the ceph.py task.
There was a problem hiding this comment.
How big the impact really is? Noise in suites, right? Aren't the failures pretty schematic (like "POOL_APP_NOT_ENABLED has been found"), and thus easy to discriminate?
I'm asking these questions as the patch fixes a real issue that popped up at a user's site.
There was a problem hiding this comment.
How big the impact really is? Noise in suites, right? Aren't the failures pretty schematic (like "
POOL_APP_NOT_ENABLEDhas been found"), and thus easy to discriminate?I'm asking these questions as the patch fixes a real issue that popped up at a user's site.
The patch can be resubmitted. I think there are two main issues here:
- This PR broke a lot of QA tests. That should have blocked merge and is justification enough to revert.
- The QA changes are widespread and should have been consolidated into e.g. ceph.py as you mentioned (as one possibility) Those need reverted even if this change is not reverted. I think the entire PR should be reverted so it can be re-[done correctly.
There was a problem hiding this comment.
[...] perhaps we could treat each new
POOL_APP_NOT_ENABLEDas muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to callrados_application_enable()API.That also seems reasonable but I think it'll still break many QA tests.
@batrick Can you elaborate on the "many"? The way I see it, the tests that don't set an application on the pools they create but then write to them would already have POOL_APP_NOT_ENABLED in the ignorelist. The tests that don't write might break -- assuming the pools in question live long enough (i.e. not removed within X minutes before the alert is raised). But this latter variety should be rare?
There was a problem hiding this comment.
[...] perhaps we could treat each new
POOL_APP_NOT_ENABLEDas muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to callrados_application_enable()API.That also seems reasonable but I think it'll still break many QA tests.
@batrick Can you elaborate on the "many"? The way I see it, the tests that don't set an application on the pools they create but then write to them would already have
POOL_APP_NOT_ENABLEDin the ignorelist. The tests that don't write might break -- assuming the pools in question live long enough (i.e. not removed within X minutes before the alert is raised). But this latter variety should be rare?
I think you're right. I retract my statement.
This reverts commit 93c326f, reversing changes made to d1a0d8d. This patchset caused wide-spread component failures. With the increased scrutiny, there is a desire to simplify this [1] by either: (a) Creating a config that is turned on for all suites to turn off this behavior. (b) Add an ignorelist entry globally in the ceph.py|cephadm.py|rook.py tasks. (c) Fix the warning to only be generated after the new pool has no application for some amount of time. Option (c) is the most robust. (b) is probably the most straightforward. [1] ceph#47560 (review) Signed-off-by: Patrick Donnelly <[email protected]>
Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>
Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>
Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>
Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>
Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>
Ceph status fail to report pool application warning if
the pool is empty. Report pool application warning
even if pool has 0 objects stored in it.
Add POOL_APP_NOT_ENABLED cluster warnings to
log-ignorelist to fix rados suite.
Fixes: https://tracker.ceph.com/issues/57097
Signed-off-by: Prashant D [email protected]
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows