mon, qa: issue pool application warning even if pool is empty by pdvian · Pull Request #47560 · ceph/ceph

pdvian · 2022-08-11T12:49:37Z

Ceph status fail to report pool application warning if
the pool is empty. Report pool application warning
even if pool has 0 objects stored in it.

Add POOL_APP_NOT_ENABLED cluster warnings to
log-ignorelist to fix rados suite.

Fixes: https://tracker.ceph.com/issues/57097

Signed-off-by: Prashant D [email protected]

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

neha-ojha · 2022-08-15T08:21:34Z

jenkins retest this please

vumrao

Good catch! thanks, Prashant.

vumrao · 2022-08-15T18:59:29Z

jenkins retest this please

ljflores · 2022-08-15T21:44:52Z

jenkins retest this please

djgalloway · 2022-08-15T22:17:46Z

CMake Error at /opt/ceph/lib/x86_64-linux-gnu/cmake/Boost-1.79.0/BoostConfig.cmake:141 (find_package):
  Found package configuration file:

    /opt/ceph/lib/x86_64-linux-gnu/cmake/boost_python-1.79.0/boost_python-config.cmake

  but it set boost_python_FOUND to FALSE so package "boost_python" is
  considered to be NOT FOUND.  Reason given by package:

  No suitable build variant has been found.

  The following variants have been tried and rejected:

  * libboost_python38.so.1.79.0 (3.8, Boost_PYTHON_VERSION=3.10)

  * libboost_python38.a (3.8, Boost_PYTHON_VERSION=3.10)

Call Stack (most recent call first):
  /opt/ceph/lib/x86_64-linux-gnu/cmake/Boost-1.79.0/BoostConfig.cmake:262 (boost_find_component)
  cmake/modules/FindBoost.cmake:597 (find_package)
  CMakeLists.txt:644 (find_package)


-- Configuring incomplete, errors occurred!

@tchaikov this is new. I converted all the Jenkins builders to Ubuntu 22.04 now. Do I need to rebuild boost making sure that python3.8 is installed or something?

djgalloway · 2022-08-16T14:33:45Z

Pretty sure I fixed the make check failure by rebuilding boost on Ubuntu 22.04 with python3.10 installed. I retriggered make check for this job but Jenkins is a bit backed up.

vumrao · 2022-08-17T16:44:20Z

jenkins retest this please

djgalloway · 2022-08-17T16:53:27Z

The make check failure is legitimately due to the code change.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/mon/PGMap.cc:3348:32: error: unused variable 'sum' [-Werror,-Wunused-variable]
      const object_stat_sum_t& sum = pstat->stats.sum;
                               ^
1 error generated.

vumrao · 2022-08-17T16:57:31Z

The make check failure is legitimately due to the code change.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/mon/PGMap.cc:3348:32: error: unused variable 'sum' [-Werror,-Wunused-variable]
      const object_stat_sum_t& sum = pstat->stats.sum;
                               ^
1 error generated.

ahh, I missed checking this time. Thanks for pointing it out, David. @pdvian FYI

src/mon/PGMap.cc

pdvian · 2022-08-18T00:57:44Z

The make check failure is legitimately due to the code change.
/home/jenkins-build/build/workspace/ceph-pull-requests/src/mon/PGMap.cc:3348:32: error: unused variable 'sum' [-Werror,-Wunused-variable]
      const object_stat_sum_t& sum = pstat->stats.sum;
                               ^
1 error generated.
ahh, I missed checking this time. Thanks for pointing it out, David. @pdvian FYI

Thanks. Yes, I should have removed this variable. My bad! Corrected it.

pdvian · 2022-08-18T01:01:01Z

Too many teuthology job failures due to this PR : https://pulpito.ceph.com/yuriw-2022-08-16_22:04:30-rados-wip-yuri5-testing-2022-08-16-0859-distro-default-smithi/

pdvian · 2022-08-18T06:32:32Z

Too many teuthology job failures due to this PR : https://pulpito.ceph.com/yuriw-2022-08-16_22:04:30-rados-wip-yuri5-testing-2022-08-16-0859-distro-default-smithi/

@neha-ojha What would be right approach to handle RADOS suite failures here ? Should we use mon_warn_on_pool_no_app=false in <ceph>/qa/tasks/ceph.conf.template or add POOL_APP_NOT_ENABLED to log-ignorelist for affected jobs ?

pdvian · 2022-09-08T00:28:41Z

http://pulpito.front.sepia.ceph.com/pdhange-2022-09-08_00:09:39-rados-wip-pdhange-testing-distro-default-smithi/

ljflores · 2022-09-27T21:46:48Z

@pdvian does this need a full rados suite?

pdvian · 2022-10-13T14:05:53Z

@pdvian does this need a full rados suite?

Yes @ljflores. I have to go through multiple iterations to fix the relevant teuthology RADOS suite testscases to add POOL_APP_NOT_ENABLED to log-ignorelist. I am working on this in the background and have additional fixes to push to PR. It should be ready by next week.

ljflores · 2022-10-13T15:09:17Z

@pdvian cool, I saw you have your own testing tag on this PR, so I wasn't sure if you were doing your own testing. But just let me know when you want me to get this in one of Yuri's testing batches!

Ceph status fail to report pool application warning if the pool is empty. Report pool application warning even if pool has 0 objects stored in it. Add POOL_APP_NOT_ENABLED cluster warnings to log-ignorelist to fix rados suite. Fixes: https://tracker.ceph.com/issues/57097 Signed-off-by: Prashant D <[email protected]>

pdvian · 2023-08-02T18:12:05Z

jenkins test make check

pdvian · 2023-08-02T21:50:03Z

@cloudbehl @nizamial09 Kindly review the qa/tasks/mgr/dashboard/test_pool.py changes.

ljflores · 2023-08-15T16:50:44Z

@pdvian looks like it's ready to go!

cbodley · 2023-08-21T19:02:40Z

this caused several new POOL_APP_NOT_ENABLED failures in the rgw suite. i'll plan to add that to our log-ignorelist, unless someone suspects an actual issue that needs to be fixed

cbodley · 2023-08-21T19:34:42Z

i'll track this in https://tracker.ceph.com/issues/62504. hopefully we can include the rgw fix in your existing reef/quincy backports, or they'll fail rgw qa

cbodley · 2023-08-21T19:40:14Z

"2023-08-18T00:20:00.000166+0000 mon.a (mon.0) 226 : cluster [WRN] overall HEALTH_WARN 1 pool(s) do not have an application enabled" in cluster log

note that one of the new failures (from https://pulpito.ceph.com/cbodley-2023-08-17_21:47:04-rgw-pr-52673-distro-default-smithi/7372307/) doesn't match our existing POOL_APP_NOT_ENABLED whitelist. that may show up in the rados suite too

batrick · 2023-09-06T01:06:58Z

src/mon/PGMap.cc

      // required release
-      if (sum.num_objects > 0 && pool.application_metadata.empty() &&
-          !pool.is_tier()) {
+      if (pool.application_metadata.empty() && !pool.is_tier()) {


So this has obviously created havoc for the other component suites. I think the right thing to do here is:

revert most of the qa/ suite changes

add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in qa/tasks/ceph.conf/template

That config can be selectively turned back on for a dedicated set of tests for this particular change.

cc @cbodley @idryomov @vshankar

I also strongly dislike ignoring warnings. This should not be a routine way to fix these types of issues even if this warning is particularly unhelpful for QA.

add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in qa/tasks/ceph.conf/template

That config can be selectively turned back on for a dedicated set of tests for this particular change.

@batrick I don't see how that is substantially different from just ignoring the health alert -- instead of the "ignoring" part being conditional, you would be making the raising of the health alert conditional, that's it.

IMO the right way to address this is to make RADOS not raise a bogus health alert independent of the configuration, so that it's not seen by regular users/operators too. It shouldn't be just developers running their tests with a special knob set who get the sane behavior. I suggested the following on the dev list:

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in qa/tasks/ceph.conf/template

That config can be selectively turned back on for a dedicated set of tests for this particular change.

@batrick I don't see how that is substantially different from just ignoring the health alert -- instead of the "ignoring" part being conditional, you would be making the raising of the health alert conditional, that's it.

The difference is we don't need to play whack-a-mole by adding the warning to ignorelist for dozens of suites. As this PR does. Of course, playing that game means we miss things which is why we're having this discussion :)

IMO the right way to address this is to make RADOS not raise a bogus health alert independent of the configuration, so that it's not seen by regular users/operators too. It shouldn't be just developers running their tests with a special knob set who get the sane behavior. I suggested the following on the dev list:

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

That also seems reasonable but I think it'll still break many QA tests.

The difference is we don't need to play whack-a-mole by adding the warning to ignorelist for dozens of suites.

I agree that the config opt would be basically an ingorelist at the C++ level. I would prefer to not put extra, QA-driven complexity here.

If ignoring is the way, then we should consider sharing log-ignorelist across the suites. At first glance it looks like a change in the ceph.py task.

How big the impact really is? Noise in suites, right? Aren't the failures pretty schematic (like "POOL_APP_NOT_ENABLED has been found"), and thus easy to discriminate?

I'm asking these questions as the patch fixes a real issue that popped up at a user's site.

How big the impact really is? Noise in suites, right? Aren't the failures pretty schematic (like "POOL_APP_NOT_ENABLED has been found"), and thus easy to discriminate?

I'm asking these questions as the patch fixes a real issue that popped up at a user's site.

The patch can be resubmitted. I think there are two main issues here:

This PR broke a lot of QA tests. That should have blocked merge and is justification enough to revert.

The QA changes are widespread and should have been consolidated into e.g. ceph.py as you mentioned (as one possibility) Those need reverted even if this change is not reverted. I think the entire PR should be reverted so it can be re-[done correctly.

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

That also seems reasonable but I think it'll still break many QA tests.

@batrick Can you elaborate on the "many"? The way I see it, the tests that don't set an application on the pools they create but then write to them would already have POOL_APP_NOT_ENABLED in the ignorelist. The tests that don't write might break -- assuming the pools in question live long enough (i.e. not removed within X minutes before the alert is raised). But this latter variety should be rare?

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

That also seems reasonable but I think it'll still break many QA tests.

@batrick Can you elaborate on the "many"? The way I see it, the tests that don't set an application on the pools they create but then write to them would already have POOL_APP_NOT_ENABLED in the ignorelist. The tests that don't write might break -- assuming the pools in question live long enough (i.e. not removed within X minutes before the alert is raised). But this latter variety should be rare?

I think you're right. I retract my statement.

This reverts commit 93c326f, reversing changes made to d1a0d8d. This patchset caused wide-spread component failures. With the increased scrutiny, there is a desire to simplify this [1] by either: (a) Creating a config that is turned on for all suites to turn off this behavior. (b) Add an ignorelist entry globally in the ceph.py|cephadm.py|rook.py tasks. (c) Fix the warning to only be generated after the new pool has no application for some amount of time. Option (c) is the most robust. (b) is probably the most straightforward. [1] ceph#47560 (review) Signed-off-by: Patrick Donnelly <[email protected]>

Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>

pdvian requested a review from a team as a code owner August 11, 2022 12:49

github-actions bot added core mon labels Aug 11, 2022

vumrao self-requested a review August 15, 2022 18:57

vumrao approved these changes Aug 15, 2022

View reviewed changes

vumrao added the needs-qa label Aug 15, 2022

yuriw added the wip-yuri5-testing label Aug 16, 2022

vumrao reviewed Aug 17, 2022

View reviewed changes

src/mon/PGMap.cc Outdated Show resolved Hide resolved

pdvian force-pushed the wip-pool-app branch from 3854c6a to 47aa4b4 Compare August 18, 2022 00:54

yuriw removed the wip-yuri5-testing label Aug 18, 2022

pdvian added the wip-pdhange-testing label Sep 5, 2022

pdvian changed the title ~~mon: issue pool application warning even if pool is empty~~ mon, qa: issue pool application warning even if pool is empty Sep 7, 2022

pdvian force-pushed the wip-pool-app branch from 47aa4b4 to cb0b41d Compare September 7, 2022 14:11

pdvian force-pushed the wip-pool-app branch from cb0b41d to 31bcea2 Compare October 18, 2022 13:19

yuriw added the wip-yuri7-testing label Oct 19, 2022

pdvian requested review from cloudbehl and nizamial09 and removed request for a team July 31, 2023 20:00

github-actions bot added the tests label Jul 31, 2023

pdvian force-pushed the wip-pool-app branch from 57634ef to 990806e Compare July 31, 2023 23:10

nizamial09 approved these changes Aug 3, 2023

View reviewed changes

ljflores merged commit 93c326f into ceph:main Aug 15, 2023

This was referenced Aug 17, 2023

reef: mon, qa: issue pool application warning even if pool is empty #53041

Merged

quincy: mon, qa: issue pool application warning even if pool is empty #53042

Merged

cbodley mentioned this pull request Aug 21, 2023

qa/rgw: add new POOL_APP_NOT_ENABLED failures to log-ignorelist #53074

Merged

batrick reviewed Sep 6, 2023

View reviewed changes

idryomov mentioned this pull request Sep 8, 2023

qa/cephfs: add back POOL_APP_NOT_ENABLED to ignorelist (partial revert) #53331

Merged

batrick mentioned this pull request Sep 8, 2023

Revert "Merge pull request #47560 from pdvian/wip-pool-app" #53336

Closed

14 tasks

pdvian pushed a commit to pdvian/ceph that referenced this pull request Nov 13, 2023

PendingReleaseNotes: Add note for POOL_APP_NOT_ENABLED

43d34eb

Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>

pdvian mentioned this pull request Nov 13, 2023

PendingReleaseNotes: Add note for POOL_APP_NOT_ENABLED #54486

Merged

14 tasks

pdvian pushed a commit to pdvian/ceph that referenced this pull request Nov 14, 2023

PendingReleaseNotes: Add note for POOL_APP_NOT_ENABLED

f50f77a

Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>

idryomov mentioned this pull request Nov 15, 2023

librbd/migration: add external clusters support #44470

Merged

14 tasks

pdvian pushed a commit to pdvian/ceph that referenced this pull request Nov 16, 2023

PendingReleaseNotes: Add note for POOL_APP_NOT_ENABLED

44886ef

Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>

aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 6, 2023

PendingReleaseNotes: Add note for POOL_APP_NOT_ENABLED

0af0f74

Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>

aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 26, 2023

PendingReleaseNotes: Add note for POOL_APP_NOT_ENABLED

751f0ea

Adds release notes for the fix added in ceph#47560 Signed-off-by: Prashant D <[email protected]>

idryomov mentioned this pull request Feb 8, 2024

Revert "qa/tasks/cephadm: enable mon_cluster_log_to_file" #55498

Closed

14 tasks

cbodley mentioned this pull request Mar 18, 2024

qa/rgw: ignore POOL_APP_NOT_ENABLED in rgw/lua subsuite #56266

Merged

Conversation

pdvian commented Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

neha-ojha commented Aug 15, 2022

Uh oh!

vumrao left a comment

Choose a reason for hiding this comment

Uh oh!

vumrao commented Aug 15, 2022

Uh oh!

ljflores commented Aug 15, 2022

Uh oh!

djgalloway commented Aug 15, 2022

Uh oh!

djgalloway commented Aug 16, 2022

Uh oh!

vumrao commented Aug 17, 2022

Uh oh!

djgalloway commented Aug 17, 2022

Uh oh!

vumrao commented Aug 17, 2022

Uh oh!

Uh oh!

pdvian commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdvian commented Aug 18, 2022

Uh oh!

pdvian commented Aug 18, 2022

Uh oh!

pdvian commented Sep 8, 2022

Uh oh!

ljflores commented Sep 27, 2022

Uh oh!

pdvian commented Oct 13, 2022

Uh oh!

ljflores commented Oct 13, 2022

Uh oh!

pdvian commented Aug 2, 2023

Uh oh!

pdvian commented Aug 2, 2023

Uh oh!

ljflores commented Aug 15, 2023

Uh oh!

cbodley commented Aug 21, 2023

Uh oh!

cbodley commented Aug 21, 2023

Uh oh!

cbodley commented Aug 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

idryomov Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

idryomov Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

pdvian commented Aug 11, 2022 •

edited

Loading

pdvian commented Aug 18, 2022 •

edited

Loading

idryomov Sep 6, 2023 •

edited

Loading

idryomov Sep 6, 2023 •

edited

Loading