Skip to content

mon, qa: issue pool application warning even if pool is empty#47560

Merged
ljflores merged 1 commit intoceph:mainfrom
pdvian:wip-pool-app
Aug 15, 2023
Merged

mon, qa: issue pool application warning even if pool is empty#47560
ljflores merged 1 commit intoceph:mainfrom
pdvian:wip-pool-app

Conversation

@pdvian
Copy link
Contributor

@pdvian pdvian commented Aug 11, 2022

Ceph status fail to report pool application warning if
the pool is empty. Report pool application warning
even if pool has 0 objects stored in it.

Add POOL_APP_NOT_ENABLED cluster warnings to
log-ignorelist to fix rados suite.

Fixes: https://tracker.ceph.com/issues/57097

Signed-off-by: Prashant D [email protected]

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@pdvian pdvian requested a review from a team as a code owner August 11, 2022 12:49
@neha-ojha
Copy link
Member

jenkins retest this please

@vumrao vumrao self-requested a review August 15, 2022 18:57
Copy link
Contributor

@vumrao vumrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! thanks, Prashant.

@vumrao
Copy link
Contributor

vumrao commented Aug 15, 2022

jenkins retest this please

1 similar comment
@ljflores
Copy link
Member

jenkins retest this please

@djgalloway
Copy link
Contributor

CMake Error at /opt/ceph/lib/x86_64-linux-gnu/cmake/Boost-1.79.0/BoostConfig.cmake:141 (find_package):
  Found package configuration file:

    /opt/ceph/lib/x86_64-linux-gnu/cmake/boost_python-1.79.0/boost_python-config.cmake

  but it set boost_python_FOUND to FALSE so package "boost_python" is
  considered to be NOT FOUND.  Reason given by package:

  No suitable build variant has been found.

  The following variants have been tried and rejected:

  * libboost_python38.so.1.79.0 (3.8, Boost_PYTHON_VERSION=3.10)

  * libboost_python38.a (3.8, Boost_PYTHON_VERSION=3.10)

Call Stack (most recent call first):
  /opt/ceph/lib/x86_64-linux-gnu/cmake/Boost-1.79.0/BoostConfig.cmake:262 (boost_find_component)
  cmake/modules/FindBoost.cmake:597 (find_package)
  CMakeLists.txt:644 (find_package)


-- Configuring incomplete, errors occurred!

@tchaikov this is new. I converted all the Jenkins builders to Ubuntu 22.04 now. Do I need to rebuild boost making sure that python3.8 is installed or something?

@djgalloway
Copy link
Contributor

Pretty sure I fixed the make check failure by rebuilding boost on Ubuntu 22.04 with python3.10 installed. I retriggered make check for this job but Jenkins is a bit backed up.

@vumrao
Copy link
Contributor

vumrao commented Aug 17, 2022

jenkins retest this please

@djgalloway
Copy link
Contributor

The make check failure is legitimately due to the code change.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/mon/PGMap.cc:3348:32: error: unused variable 'sum' [-Werror,-Wunused-variable]
      const object_stat_sum_t& sum = pstat->stats.sum;
                               ^
1 error generated.

@vumrao
Copy link
Contributor

vumrao commented Aug 17, 2022

The make check failure is legitimately due to the code change.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/mon/PGMap.cc:3348:32: error: unused variable 'sum' [-Werror,-Wunused-variable]
      const object_stat_sum_t& sum = pstat->stats.sum;
                               ^
1 error generated.

ahh, I missed checking this time. Thanks for pointing it out, David. @pdvian FYI

@pdvian
Copy link
Contributor Author

pdvian commented Aug 18, 2022

The make check failure is legitimately due to the code change.

/home/jenkins-build/build/workspace/ceph-pull-requests/src/mon/PGMap.cc:3348:32: error: unused variable 'sum' [-Werror,-Wunused-variable]
      const object_stat_sum_t& sum = pstat->stats.sum;
                               ^
1 error generated.

ahh, I missed checking this time. Thanks for pointing it out, David. @pdvian FYI

Thanks. Yes, I should have removed this variable. My bad! Corrected it.

@pdvian
Copy link
Contributor Author

pdvian commented Aug 18, 2022

@pdvian
Copy link
Contributor Author

pdvian commented Aug 18, 2022

Too many teuthology job failures due to this PR : https://pulpito.ceph.com/yuriw-2022-08-16_22:04:30-rados-wip-yuri5-testing-2022-08-16-0859-distro-default-smithi/

@neha-ojha What would be right approach to handle RADOS suite failures here ? Should we use mon_warn_on_pool_no_app=false in <ceph>/qa/tasks/ceph.conf.template or add POOL_APP_NOT_ENABLED to log-ignorelist for affected jobs ?

@pdvian pdvian changed the title mon: issue pool application warning even if pool is empty mon, qa: issue pool application warning even if pool is empty Sep 7, 2022
@ljflores
Copy link
Member

@pdvian does this need a full rados suite?

@pdvian
Copy link
Contributor Author

pdvian commented Oct 13, 2022

@pdvian does this need a full rados suite?

Yes @ljflores. I have to go through multiple iterations to fix the relevant teuthology RADOS suite testscases to add POOL_APP_NOT_ENABLED to log-ignorelist. I am working on this in the background and have additional fixes to push to PR. It should be ready by next week.

@ljflores
Copy link
Member

@pdvian cool, I saw you have your own testing tag on this PR, so I wasn't sure if you were doing your own testing. But just let me know when you want me to get this in one of Yuri's testing batches!

@pdvian pdvian requested review from cloudbehl and nizamial09 and removed request for a team July 31, 2023 20:00
@github-actions github-actions bot added the tests label Jul 31, 2023
Ceph status fail to report pool application warning if
the pool is empty. Report pool application warning
even if pool has 0 objects stored in it.

Add POOL_APP_NOT_ENABLED cluster warnings to log-ignorelist
to fix rados suite.

Fixes: https://tracker.ceph.com/issues/57097

Signed-off-by: Prashant D <[email protected]>
@pdvian
Copy link
Contributor Author

pdvian commented Aug 2, 2023

jenkins test make check

@pdvian
Copy link
Contributor Author

pdvian commented Aug 2, 2023

@cloudbehl @nizamial09 Kindly review the qa/tasks/mgr/dashboard/test_pool.py changes.

@ljflores
Copy link
Member

@pdvian looks like it's ready to go!

@cbodley
Copy link
Contributor

cbodley commented Aug 21, 2023

this caused several new POOL_APP_NOT_ENABLED failures in the rgw suite. i'll plan to add that to our log-ignorelist, unless someone suspects an actual issue that needs to be fixed

@cbodley
Copy link
Contributor

cbodley commented Aug 21, 2023

i'll track this in https://tracker.ceph.com/issues/62504. hopefully we can include the rgw fix in your existing reef/quincy backports, or they'll fail rgw qa

@cbodley
Copy link
Contributor

cbodley commented Aug 21, 2023

"2023-08-18T00:20:00.000166+0000 mon.a (mon.0) 226 : cluster [WRN] overall HEALTH_WARN 1 pool(s) do not have an application enabled" in cluster log

note that one of the new failures (from https://pulpito.ceph.com/cbodley-2023-08-17_21:47:04-rgw-pr-52673-distro-default-smithi/7372307/) doesn't match our existing POOL_APP_NOT_ENABLED whitelist. that may show up in the rados suite too

// required release
if (sum.num_objects > 0 && pool.application_metadata.empty() &&
!pool.is_tier()) {
if (pool.application_metadata.empty() && !pool.is_tier()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this has obviously created havoc for the other component suites. I think the right thing to do here is:

  • revert most of the qa/ suite changes
  • add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in qa/tasks/ceph.conf/template

That config can be selectively turned back on for a dedicated set of tests for this particular change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cbodley @idryomov @vshankar

I also strongly dislike ignoring warnings. This should not be a routine way to fix these types of issues even if this warning is particularly unhelpful for QA.

Copy link
Contributor

@idryomov idryomov Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in qa/tasks/ceph.conf/template

That config can be selectively turned back on for a dedicated set of tests for this particular change.

@batrick I don't see how that is substantially different from just ignoring the health alert -- instead of the "ignoring" part being conditional, you would be making the raising of the health alert conditional, that's it.

IMO the right way to address this is to make RADOS not raise a bogus health alert independent of the configuration, so that it's not seen by regular users/operators too. It shouldn't be just developers running their tests with a special knob set who get the sane behavior. I suggested the following on the dev list:

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • add a developer config to exclude pools that are empty which is turned on for all qa tests, probably in qa/tasks/ceph.conf/template

That config can be selectively turned back on for a dedicated set of tests for this particular change.

@batrick I don't see how that is substantially different from just ignoring the health alert -- instead of the "ignoring" part being conditional, you would be making the raising of the health alert conditional, that's it.

The difference is we don't need to play whack-a-mole by adding the warning to ignorelist for dozens of suites. As this PR does. Of course, playing that game means we miss things which is why we're having this discussion :)

IMO the right way to address this is to make RADOS not raise a bogus health alert independent of the configuration, so that it's not seen by regular users/operators too. It shouldn't be just developers running their tests with a special knob set who get the sane behavior. I suggested the following on the dev list:

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

That also seems reasonable but I think it'll still break many QA tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is we don't need to play whack-a-mole by adding the warning to ignorelist for dozens of suites.

I agree that the config opt would be basically an ingorelist at the C++ level. I would prefer to not put extra, QA-driven complexity here.

If ignoring is the way, then we should consider sharing log-ignorelist across the suites. At first glance it looks like a change in the ceph.py task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big the impact really is? Noise in suites, right? Aren't the failures pretty schematic (like "POOL_APP_NOT_ENABLED has been found"), and thus easy to discriminate?

I'm asking these questions as the patch fixes a real issue that popped up at a user's site.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big the impact really is? Noise in suites, right? Aren't the failures pretty schematic (like "POOL_APP_NOT_ENABLED has been found"), and thus easy to discriminate?

I'm asking these questions as the patch fixes a real issue that popped up at a user's site.

The patch can be resubmitted. I think there are two main issues here:

  • This PR broke a lot of QA tests. That should have blocked merge and is justification enough to revert.
  • The QA changes are widespread and should have been consolidated into e.g. ceph.py as you mentioned (as one possibility) Those need reverted even if this change is not reverted. I think the entire PR should be reverted so it can be re-[done correctly.

Copy link
Contributor

@idryomov idryomov Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

That also seems reasonable but I think it'll still break many QA tests.

@batrick Can you elaborate on the "many"? The way I see it, the tests that don't set an application on the pools they create but then write to them would already have POOL_APP_NOT_ENABLED in the ignorelist. The tests that don't write might break -- assuming the pools in question live long enough (i.e. not removed within X minutes before the alert is raised). But this latter variety should be rare?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] perhaps we could treat each new POOL_APP_NOT_ENABLED as muted behind the scenes for a couple of minutes? Essentially just delay raising the alert to give admins enough time to run "ceph osd pool application enable" or our own tools to call rados_application_enable() API.

That also seems reasonable but I think it'll still break many QA tests.

@batrick Can you elaborate on the "many"? The way I see it, the tests that don't set an application on the pools they create but then write to them would already have POOL_APP_NOT_ENABLED in the ignorelist. The tests that don't write might break -- assuming the pools in question live long enough (i.e. not removed within X minutes before the alert is raised). But this latter variety should be rare?

I think you're right. I retract my statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batrick added a commit to batrick/ceph that referenced this pull request Sep 8, 2023
This reverts commit 93c326f, reversing
changes made to d1a0d8d.

This patchset caused wide-spread component failures. With the increased scrutiny, there is a
desire to simplify this [1] by either:

(a) Creating a config that is turned on for all suites to turn off this behavior.

(b) Add an ignorelist entry globally in the ceph.py|cephadm.py|rook.py tasks.

(c) Fix the warning to only be generated after the new pool has no application for some amount of time.

Option (c) is the most robust. (b) is probably the most straightforward.

[1] ceph#47560 (review)

Signed-off-by: Patrick Donnelly <[email protected]>
pdvian pushed a commit to pdvian/ceph that referenced this pull request Nov 13, 2023
Adds release notes for the fix added in ceph#47560

Signed-off-by: Prashant D <[email protected]>
pdvian pushed a commit to pdvian/ceph that referenced this pull request Nov 14, 2023
Adds release notes for the fix added in ceph#47560

Signed-off-by: Prashant D <[email protected]>
pdvian pushed a commit to pdvian/ceph that referenced this pull request Nov 16, 2023
Adds release notes for the fix added in ceph#47560

Signed-off-by: Prashant D <[email protected]>
aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 6, 2023
Adds release notes for the fix added in ceph#47560

Signed-off-by: Prashant D <[email protected]>
aaSharma14 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Dec 26, 2023
Adds release notes for the fix added in ceph#47560

Signed-off-by: Prashant D <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.