Skip to content

Comments

osd: fix require_min_compat_client handling for msr rules#59474

Merged
ljflores merged 4 commits intoceph:mainfrom
athanatos:sjust/for-review/wip-67755-fix-msr-feature
Sep 5, 2024
Merged

osd: fix require_min_compat_client handling for msr rules#59474
ljflores merged 4 commits intoceph:mainfrom
athanatos:sjust/for-review/wip-67755-fix-msr-feature

Conversation

@athanatos
Copy link
Contributor

This PR fixes 3 problems:

  1. check require_min_compat_client when creating a crush rule while creating a pool with an ec profile
  2. consider a crushmap to require CRUSH_MSR if any rules are msr rules, even if not attached to a pool
  3. use CEPH_FEATUREMASK_CRUSH_MSR rather than CEPH_FEATURE_CRUSH_MSR

See commit messages for details.

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

…pool

OSDMap::get_features is used by
OSDMonitor::validate_crush_against_features via
OSDMap::get_min_compat_client() to check whether changes to the crushmap
will require newer features than the existing require_min_compat_client
field.

Monitor commands which create rules from ec profiles may result in msr
rules.  While it might be harmless to allow msr rules to exist as long
as there aren't any pools actually using the rule, it's probably simpler
to disallow their creation in the first place until
require_min_compat_client is updated.

Signed-off-by: Samuel Just <[email protected]>
This was simply a mistake:

uint64_t features = 0;
features |= CEPH_FEATURE_CRUSH_MSR;
ceph_assert(HAVE_FEATURE(features, CRUSH_MSR));

will fail the assert as CEPH_FEATURE_CRUSH_MSR lacks the mask.  In
basically all cases, CEPH_FEATUREMASK_* is the correct choice.

Signed-off-by: Samuel Just <[email protected]>
Also adjusts validate_crush_against_features to use an ostream&
rather than a stringstream&.

Fixes: https://tracker.ceph.com/issues/67755
Signed-off-by: Samuel Just <[email protected]>
@athanatos athanatos requested a review from ljflores August 28, 2024 04:56
@athanatos athanatos requested a review from a team as a code owner August 28, 2024 04:56
@ljflores
Copy link
Member

jenkins test api

@ljflores
Copy link
Member

jenkins test make check

Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I tested this using the following steps:

Start a vstart cluster with 7 OSDs:

$ OSD=7 ../src/vstart.sh --debug --new -x --localhost --bluestore

Create MSR ec profile:

$ ./bin/ceph osd erasure-code-profile set ec86-profile k=8 m=6 crush-failure-domain=host crush-osds-per-failure-domain=4 crush-num-failure-domains=4

Try to create an ec pool with it. It correctly fails since require-min-compat-client is luminous (< squid):

$ ./bin/ceph osd pool create ec86-pool erasure ec86-profile
Error EINVAL: new crush map requires client version squid but require_min_compat_client is luminous

Set require-min-compat-client to something else, still < squid:

$ ./bin/ceph osd set-require-min-compat-client reef
set require_min_compat_client to reef

Try to create an ec pool with the MSR ec profile. It still fails since reef < squid:

$ ./bin/ceph osd pool create ec86-pool erasure ec86-profile
Error EINVAL: new crush map requires client version squid but require_min_compat_client is reef

Try to change the crushmap manually with an old crush map with an MSR rule. It fails since crushmaps with MSR rules should not be accepted if require-min-compat-client < squid:

$ ./bin/ceph osd setcrushmap -i ~/bz_2302230/crush.map.bin
Error EINVAL: new crush map requires client version squid but require_min_compat_client is reef

Set require-min-compat-client to squid:

$ ./bin/ceph osd set-require-min-compat-client squid
set require_min_compat_client to squid

Try to create a pool with the MSR ec profile. It succeeds this time since require-min-compat-client = squid:

$ ./bin/ceph osd pool create ec86-pool erasure ec86-profile
pool 'ec86-pool' created

Try to set require-min-compat-client lower than squid. This fails since we have an MSR rule in use, which uses squid features:

$ ./bin/ceph osd set-require-min-compat-client reef
Error EPERM: osdmap current utilizes features that require squid; cannot set require_min_compat_client below that to reef

Extract the crushmap and try to reapply it via CLI. This succeeds, as require-min-compat-client = squid.

$ ./bin/ceph osd getcrushmap > crush.map.bin
16
$ ./bin/ceph osd setcrushmap -i crush.map.bin
17

@ljflores
Copy link
Member

ljflores commented Aug 29, 2024

@athanatos so far I found one type of failure related to this PR. I'm still reviewing the upgrade suite, so full results pending.

Tested here: https://tracker.ceph.com/issues/67783

/a/yuriw-2024-08-28_23:20:36-rados-wip-yuri4-testing-2024-08-28-1359-distro-default-smithi/7879335
/a/yuriw-2024-08-28_23:20:36-rados-wip-yuri4-testing-2024-08-28-1359-distro-default-smithi/7879444
/a/yuriw-2024-08-28_23:20:36-rados-wip-yuri4-testing-2024-08-28-1359-distro-default-smithi/7879549
2024-08-29T05:22:39.061 INFO:teuthology.orchestra.run.smithi070.stderr:noscrub is set
2024-08-29T05:22:39.076 INFO:tasks.rados:starting run 0 out of 1
2024-08-29T05:22:39.076 INFO:tasks.ceph.ceph_manager.ceph:creating pool_name unique_pool_0
2024-08-29T05:22:39.076 DEBUG:teuthology.orchestra.run.smithi070:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool create unique_pool_0 16 16 erasure jerasure21profile
2024-08-29T05:22:39.094 DEBUG:teuthology.orchestra.run.smithi096:> sudo adjust-ulimits ceph-objectstore-tool --err-to-stderr --no-mon-config '--log-file=/var/log/ceph/objectstore_tool.$pid.log' --data-path /var/lib/ceph/osd/ceph-6 --journal-path /var/lib/ceph/osd/ceph-6/journal --op export --pgid 2.6 --file /var/log/ceph/exp.2.6.6
2024-08-29T05:22:39.160 INFO:tasks.ceph.osd.2.smithi070.stderr:2024-08-29T05:22:39.158+0000 7fdfc6d87640 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  (PID: 22460) UID: 0
2024-08-29T05:22:39.167 INFO:teuthology.orchestra.run.smithi070.stdout:ERROR: (22) Invalid argument
2024-08-29T05:22:39.167 INFO:teuthology.orchestra.run.smithi070.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2024-08-29T05:22:39.177 DEBUG:teuthology.orchestra.run.smithi070:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok dump_ops_in_flight
2024-08-29T05:22:39.257 INFO:tasks.ceph.osd.8.smithi130.stderr:2024-08-29T05:22:39.256+0000 7f030f3e8640 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 8  (PID: 22908) UID: 0
2024-08-29T05:22:39.324 INFO:teuthology.orchestra.run.smithi070.stdout:ERROR: (22) Invalid argument
2024-08-29T05:22:39.324 INFO:teuthology.orchestra.run.smithi070.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2024-08-29T05:22:39.333 DEBUG:teuthology.orchestra.run.smithi070:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok dump_blocked_ops
2024-08-29T05:22:39.358 INFO:tasks.ceph.osd.7.smithi096.stderr:2024-08-29T05:22:39.357+0000 7fac60427640 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 7  (PID: 21960) UID: 0
2024-08-29T05:22:39.422 INFO:teuthology.orchestra.run.smithi070.stderr:Error EINVAL: new crush map requires client version squid but require_min_compat_client is luminous

@github-actions github-actions bot added the tests label Aug 30, 2024
@athanatos
Copy link
Contributor Author

@athanatos
Copy link
Contributor Author

@ljflores Looks like my commit above fixed that failure.

@ljflores
Copy link
Member

ljflores commented Sep 3, 2024

jenkins test make check

@ljflores
Copy link
Member

ljflores commented Sep 3, 2024

@ljflores
Copy link
Member

ljflores commented Sep 3, 2024

@ljflores
Copy link
Member

ljflores commented Sep 4, 2024

jenkins test make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants