Skip to content

Revert "Merge pull request #47560 from pdvian/wip-pool-app"#53336

Closed
batrick wants to merge 2 commits intoceph:mainfrom
batrick:revert-47560
Closed

Revert "Merge pull request #47560 from pdvian/wip-pool-app"#53336
batrick wants to merge 2 commits intoceph:mainfrom
batrick:revert-47560

Conversation

@batrick
Copy link
Member

@batrick batrick commented Sep 8, 2023

This reverts commit 93c326f, reversing changes made to d1a0d8d.

This patchset caused wide-spread component failures. With the increased scrutiny, there is a desire to simplify this [1] by either:

(a) Creating a config that is turned on for all suites to turn off this behavior.

(b) Add an ignorelist entry globally in the ceph.py|cephadm.py|rook.py tasks.

(c) Fix the warning to only be generated after the new pool has no application for some amount of time.

Option (c) is the most robust. (b) is probably the most straightforward.

[1] #47560 (review)

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

This reverts commit 93c326f, reversing
changes made to d1a0d8d.

This patchset caused wide-spread component failures. With the increased scrutiny, there is a
desire to simplify this [1] by either:

(a) Creating a config that is turned on for all suites to turn off this behavior.

(b) Add an ignorelist entry globally in the ceph.py|cephadm.py|rook.py tasks.

(c) Fix the warning to only be generated after the new pool has no application for some amount of time.

Option (c) is the most robust. (b) is probably the most straightforward.

[1] ceph#47560 (review)

Signed-off-by: Patrick Donnelly <[email protected]>
@batrick
Copy link
Member Author

batrick commented Sep 8, 2023

I did not include a revert of 4293d9b because it had other changes that should perhaps be kept.

@batrick
Copy link
Member Author

batrick commented Sep 8, 2023

I did not include a revert of 4293d9b because it had other changes that should perhaps be kept.

cc @cbodley

overrides:
ceph:
log-ignorelist:
- \(POOL_APP_NOT_ENABLED\)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this was added strictly to mop up the fallout and not because crimson, being a RADOS suite in nature, has grown a legitimate need?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it was merged 5 days ago, I think it's fallout mop up.

@idryomov
Copy link
Contributor

idryomov commented Sep 8, 2023

This patchset caused wide-spread component failures. With the increased scrutiny, there is a desire to simplify this [1] by either:

(a) Creating a config that is turned on for all suites to turn off this behavior.

(b) Add an ignorelist entry globally in the ceph.py|cephadm.py|rook.py tasks.

(c) Fix the warning to only be generated after the new pool has no application for some amount of time.

Option (c) is the most robust. (b) is probably the most straightforward.

I think the revert would be justified only if we go with (c), which would also cover regular users/operators. If the intention is to proceed with (a) or (b), I doubt the revert churn is worth it -- individual suites have already been adjusted for the most part.

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @batrick for bringing the commits. I talked with @pdvian yesterday on approaches for the warning suppression. Removing the POOL_APP_NOT_ENABLED from all the yamls will be useful in any way.

The thing I don't know yet is whether this revert should be merged immediately or new commits can be appended and merged in one go, assuring uninterrupted presence of the fix.

The questions from #47560 (comment) are still not answered entirely. I would like to know how painful the fallout from the patch really is. Since the discussion there I see that many has been replaced with wide-spread. However, none of these terms has been quantified.

Another uncovered aspect is how much of effort is needed to discriminate the noise. IIUC failures due to POOL_APP_NOT_ENABLED are schematic, possible to filter-out at first glance.

My intention here is to weigh what's less evil: 1) leaving some (how many? ;-) tests to fail and treating this as an incentive to quickly clean the fallout or 2) reverting the imperfect-but-existing fix and potentially stepping into long discussion on how to warn in the proper way. Personally I'm afraid something is missed in the design (therefore using emptiness as a proxy) and there is no proper way without touching the interface.

…-ignore-pool-app"

This reverts commit a089373, reversing
changes made to 6dda3e2.

Signed-off-by: Patrick Donnelly <[email protected]>
@batrick
Copy link
Member Author

batrick commented Sep 8, 2023

Thanks @batrick for bringing the commits. I talked with @pdvian yesterday on approaches for the warning suppression. Removing the POOL_APP_NOT_ENABLED from all the yamls will be useful in any way.

The thing I don't know yet is whether this revert should be merged immediately or new commits can be appended and merged in one go, assuring uninterrupted presence of the fix.

The questions from #47560 (comment) are still not answered entirely. I would like to know how painful the fallout from the patch really is. Since the discussion there I see that many has been replaced with wide-spread. However, none of these terms has been quantified.

There are several tickets now indicating that virtually every component was affected by this PR.

Another uncovered aspect is how much of effort is needed to discriminate the noise. IIUC failures due to POOL_APP_NOT_ENABLED are schematic, possible to filter-out at first glance.

The larger concern is that these warnings may be shadowing true failures.

My intention here is to weigh what's less evil: 1) leaving some (how many? ;-) tests to fail and treating this as an incentive to quickly clean the fallout or 2) reverting the imperfect-but-existing fix and potentially stepping into long discussion on how to warn in the proper way. Personally I'm afraid something is missed in the design (therefore using emptiness as a proxy) and there is no proper way without touching the interface.

The widespread proliferation of filtering a legitimate warning should be very concerning. My position has always been that ignoring a warning across an entire suite is a band-aid that will always come back to bite us. For example:

https://github.com/ceph/ceph/blob/main/qa/rgw/ignore-pg-availability.yaml

The entire rgw suite is ignoring PG_DEGRADED warnings. A whole class of RADOS errors we might hope to find in component testing are now squelched.

I personally like @idryomov 's option (c). The PR should go back and address this. Selective reverts of most of the change-set by removing the warnings after a legitimate fix is even messier than a revert + proper fix in my opinion.

So yes, I think this should be merged as-is immediately.

@rzarzynski
Copy link
Contributor

rzarzynski commented Sep 8, 2023

There are several tickets now indicating that virtually every component was affected by this PR.

Well, are they severely impacted? I'm not arguing on the lack of fallout. I'm arguing that reverting existing fix without bringing an alternative may be more problematic than the fallout itself.

The larger concern is that these warnings may be shadowing true failures.

True. The tests must be fixed in reasonable time. However, this doesn't imply reverting the fix immediately.

(c) Fix the warning to only be generated after the new pool has no application for some amount of time.

This would only change the proxy, the heuristic for guessing-out the information we miss: from emptiness to time.

@rzarzynski
Copy link
Contributor

My position has always been that ignoring a warning across an entire suite is a band-aid that will always come back to bite us.

Global suppression vs guessing out the missed info. Both are stinky and now we're stepping into long discussion ;-) which one is worse. This is precisely the reason why I'm afraid about reverting without an accepted alternative.

@pdvian pdvian requested a review from cbodley September 11, 2023 22:29
@batrick
Copy link
Member Author

batrick commented Sep 12, 2023

Closing this because the urgency is gone and @pdvian has been asked to fix this in a followup PR.

@batrick batrick closed this Sep 12, 2023
@batrick batrick deleted the revert-47560 branch September 12, 2023 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants