Skip to content

Comments

rgw: replace null_yields with optional ones#50206

Merged
cbodley merged 8 commits intoceph:mainfrom
TRYTOBE8TME:wip-rgw-yield-work
Jun 21, 2023
Merged

rgw: replace null_yields with optional ones#50206
cbodley merged 8 commits intoceph:mainfrom
TRYTOBE8TME:wip-rgw-yield-work

Conversation

@TRYTOBE8TME
Copy link

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@cbodley cbodley changed the title Wip rgw yield work rgw: replace null_yields with optional ones Feb 22, 2023
Copy link
Contributor

@cbodley cbodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great overall! 'make check' is catching some unit tests under src/test/rgw/ that need updating also

const uint64_t queue_size = cct->_conf->rgw_gc_max_queue_size, num_deferred_entries = cct->_conf->rgw_gc_max_deferred;
gc_log_init2(op, queue_size, num_deferred_entries);
store->gc_operate(this, obj_names[i], &op);
store->gc_operate(this, obj_names[i], &op, null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you added an optional_yield y argument to RGWGC::initialize(), so you can pass that here. same with RGWGC::send_split_chain() and several other functions in rgw_gc.cc

@dang
Copy link
Contributor

dang commented Feb 22, 2023

BTW, you can "make tests" (or "ninja tests") to build the tests that make check uses without running them.

auto& ref = obj.get_ref();
int ret = cls_timeindex_trim_repeat(dpp, ref, oid, utime_t(start_time), utime_t(end_time),
from_marker, to_marker);
from_marker, to_marker, null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added an optional_yield to objexp_hint_trim so you don't need a null_yield here.


int ret = exp_store.objexp_hint_trim(dpp, shard, rt_from, rt_to,
from_marker, to_marker);
from_marker, to_marker, null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, there's an optional_yield added to trim_chunk.


if (need_trim) {
trim_chunk(dpp, shard, last_run, round_start, marker, out_marker);
trim_chunk(dpp, shard, last_run, round_start, marker, out_marker, null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, there's an optional_yield added to process_single_shard.

ldpp_dout(dpp, 20) << "processing shard = " << shard << dendl;

if (! process_single_shard(dpp, shard, last_run, round_start)) {
if (! process_single_shard(dpp, shard, last_run, round_start, null_yield)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, there's an optional_yield added to inspect_all_shards.

int RGWRados::try_refresh_bucket_info(RGWBucketInfo& info,
ceph::real_time *pmtime,
const DoutPrefixProvider *dpp,
const DoutPrefixProvider *dpp, optional_yield y,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley It's a knit, but there are dozens of instances in this PR where the optional_yield argument in a function definition is added on the same line as another argument even though every other argument in the function is on it's own line. I think these optional_yields should go on their own line. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm generally fine with arguments on the same line as long as it doesn't go over ~80 columns

but if doubling up arguments looks inconsistent with other nearby function definitions, you're welcome to request changes

bucket->get_info().owner, obj_ctx,
dest_obj->clone(), olh_epoch, tag,
dpp, null_yield);
dpp, y);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kapandya @cbodley just a heads up I added the comment on the lines above this AtomicObjectProcessor initialization saying that changing this null_yield to a y causes ragweed tests to fail. I wouldn't be surprised if it happens when we run this PR through teuthology again.

break;
}
res = s->bucket->try_refresh_info(s, nullptr);
res = s->bucket->try_refresh_info(s, nullptr, null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use s->yield here and anywhere else there is a req_state which is commonly called s.

s->bucket->get_info().set_sync_policy(std::move(sync_policy));

int ret = s->bucket->put_info(this, false, real_time());
int ret = s->bucket->put_info(this, false, real_time(), null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a y.

s->bucket->get_info().set_sync_policy(std::move(sync_policy));

int ret = s->bucket->put_info(this, false, real_time());
int ret = s->bucket->put_info(this, false, real_time(), null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a y.

s->bucket->get_info().has_website = true;
s->bucket->get_info().website_conf = website_conf;
op_ret = s->bucket->put_info(this, false, real_time());
op_ret = s->bucket->put_info(this, false, real_time(), null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a y.

s->bucket->get_info().has_website = false;
s->bucket->get_info().website_conf = RGWBucketWebsiteConf();
op_ret = s->bucket->put_info(this, false, real_time());
op_ret = s->bucket->put_info(this, false, real_time(), null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a y.

op_ret = retry_raced_bucket_write(this, s->bucket.get(), [this] {
s->bucket->get_info().obj_lock = obj_lock;
op_ret = s->bucket->put_info(this, false, real_time());
op_ret = s->bucket->put_info(this, false, real_time(), null_yield);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a y.

@alimaredia
Copy link
Contributor

@kapandya great start to this work! On top of the other comments that were left do you have a list of which ops don't have anymore librados blocking warnings anymore and which still do or are untested? What happens with a full run of s3-tests?

@github-actions
Copy link

github-actions bot commented Mar 1, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

kapandya added 4 commits June 20, 2023 03:18
Passing optional_yield to rgw_rados.cc

Signed-off-by: Kalpesh Pandya <[email protected]>
Changing null_yield in rgw_reshard.cc

Signed-off-by: Kalpesh Pandya <[email protected]>
Changing null_yield in rgw_sal_rados.cc

Signed-off-by: Kalpesh Pandya <[email protected]>
@mattbenjamin
Copy link
Contributor

we must be very close here, thanks for the rebase, @TRYTOBE8TME :)

@mattbenjamin
Copy link
Contributor

I see that make check failed:

[ RUN ] Queue.SpawnAsyncRequest
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/rgw/test_rgw_dmclock_scheduler.cc:425: Failure
Value of: context.stopped()
Actual: false
Expected: true
[ FAILED ] Queue.SpawnAsyncRequest (9 ms)
[----------] 8 tests from Queue (42 ms total)

@TRYTOBE8TME
Copy link
Author

jenkins test make check

@TRYTOBE8TME
Copy link
Author

jenkins test api

@TRYTOBE8TME
Copy link
Author

I see that make check failed:

[ RUN ] Queue.SpawnAsyncRequest /home/jenkins-build/build/workspace/ceph-pull-requests/src/test/rgw/test_rgw_dmclock_scheduler.cc:425: Failure Value of: context.stopped() Actual: false Expected: true [ FAILED ] Queue.SpawnAsyncRequest (9 ms) [----------] 8 tests from Queue (42 ms total)

These went away but now I see ceph API tests check failing and it doesn't show any reason

@cbodley
Copy link
Contributor

cbodley commented Jun 21, 2023

@cbodley
Copy link
Contributor

cbodley commented Jun 21, 2023

jenkins test api

@cbodley cbodley merged commit 7c94864 into ceph:main Jun 21, 2023
@cbodley
Copy link
Contributor

cbodley commented Jun 22, 2023

thanks @TRYTOBE8TME! are there more null_yields that we need to catch? @adamemerson had found some in #52036

as a next step, i'd like to find a way to add test coverage in teuthology to make sure that no requests are blocking. that might involve a config variable (something like rgw_debug_require_beast_async) that causes us to assert instead of just logging WARNING: blocking librados call. what do you think?

@mattbenjamin
Copy link
Contributor

@cbodley does this PR need to block on adding tests which find more blocking calls we could fix? I think we want to get this one in when it's complete and stable, no?

@yuvalif
Copy link
Contributor

yuvalif commented Jun 22, 2023

thanks @TRYTOBE8TME! are there more null_yields that we need to catch? @adamemerson had found some in #52036

as a next step, i'd like to find a way to add test coverage in teuthology to make sure that no requests are blocking. that might involve a config variable (something like rgw_debug_require_beast_async) that causes us to assert instead of just logging WARNING: blocking librados call. what do you think?

maybe there should be a compilation flag to compile out all the blocking rados calls, and compile the RGW with this flag?

@TRYTOBE8TME
Copy link
Author

TRYTOBE8TME commented Jun 22, 2023

@cbodley yes there are some places where null_yield needs to be replaced and basically all those places are somewhere part of an op which is remaining to be done.

@cbodley
Copy link
Contributor

cbodley commented Jun 22, 2023

@cbodley does this PR need to block on adding tests which find more blocking calls we could fix? I think we want to get this one in when it's complete and stable, no?

@mattbenjamin this PR has merged, sir! i think we're close to the finish line here, so i'd like to add some regression testing to catch any remaining blocking calls, and keep us from adding new ones. then we can do some performance measurements to figure out how to size the thread pool optimally

@cbodley
Copy link
Contributor

cbodley commented Jun 22, 2023

maybe there should be a compilation flag to compile out all the blocking rados calls, and compile the RGW with this flag?

@yuvalif we still need the blocking mode for radosgw-admin calls, librgw requests, and background threads. it's just the asio threads (like the beast frontend and rgw::notify::Manager) that should enforce async-only

@yuvalif
Copy link
Contributor

yuvalif commented Jun 22, 2023

maybe there should be a compilation flag to compile out all the blocking rados calls, and compile the RGW with this flag?

@yuvalif we still need the blocking mode for radosgw-admin calls, librgw requests, and background threads. it's just the asio threads (like the beast frontend and rgw::notify::Manager) that should enforce async-only

so, can we force an explicit null_yield in these cases?
or do you want tio catch beast+null_yield cases as well?

@cbodley
Copy link
Contributor

cbodley commented Jun 23, 2023

cfsnyder added a commit to cfsnyder/ceph that referenced this pull request Jul 18, 2023
Also clears stale pending entries prior to attempting to
clear OLH index entries and remove the OLH object.

Signed-off-by: Cory Snyder <[email protected]>
(cherry picked from commit 3437897)

Conflicts:
	src/rgw/driver/rados/rgw_rados.cc
	src/rgw/driver/rados/rgw_rados.h

Cherry-pick notes:
- conflicts due to ceph#50206 on main but not reef
cfsnyder added a commit to cfsnyder/ceph that referenced this pull request Jul 18, 2023
…ror scenarios

Makes request threads clean up after themselves and remove their pending
xattrs from OLH objects before they return an error from set_olh.

Fixes: https://tracker.ceph.com/issues/59663
Signed-off-by: Cory Snyder <[email protected]>
(cherry picked from commit 38133e5)

Conflicts:
	src/rgw/driver/rados/rgw_rados.cc
	src/rgw/driver/rados/rgw_rados.h

Cherry-pick notes:
- conflicts due to ceph#50206 on main but not reef
cfsnyder added a commit to cfsnyder/ceph that referenced this pull request Jul 18, 2023
When pending_removal is true and the log is trimmed prior to
clearing olh, it allows for the possibility that the olh is not
cleared (e.g. due to new pending xattrs on the olh obj), but the
olh log entries are removed.

Signed-off-by: Cory Snyder <[email protected]>
(cherry picked from commit dfbe4f9)

Conflicts:
	src/rgw/driver/rados/rgw_rados.cc

Cherry-pick notes:
- conflicts due to ceph#50206 on main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants