Skip to content

cql_server::connection: process: rebounce msg if needed#17309

Closed
bhalevy wants to merge 6 commits intoscylladb:masterfrom
bhalevy:safe-process_on_shard
Closed

cql_server::connection: process: rebounce msg if needed#17309
bhalevy wants to merge 6 commits intoscylladb:masterfrom
bhalevy:safe-process_on_shard

Conversation

@bhalevy
Copy link
Copy Markdown
Member

@bhalevy bhalevy commented Feb 13, 2024

Re-bounce the msg to another shard if needed,
e.g. in the case of tablet migration.

Fixes #15465

@bhalevy bhalevy requested review from piodul and tgrabiec February 13, 2024 11:23
Copy link
Copy Markdown
Contributor

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be hard to write an automated unit test (looks like it might not be easy)?

Comment thread transport/server.cc Outdated
Comment thread transport/server.cc Outdated
@scylladb-promoter
Copy link
Copy Markdown
Contributor

Comment thread transport/server.cc Outdated
@bhalevy bhalevy force-pushed the safe-process_on_shard branch from 97fed73 to 0f588c5 Compare February 14, 2024 01:00
@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Feb 14, 2024

In v2 (0f588c5)

  • rewritten coroutines to turn co-recursion into a while loop
  • did some preliminary refactoring to organize the code churn:
    • cql_server: move process_fn_return_type to class definition
    • cql_server: connection: process: add template concept for process_fn

@scylladb-promoter
Copy link
Copy Markdown
Contributor

@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Feb 14, 2024

CI failed in debug mode due to:

ERROR 2024-02-14 03:42:28,855 [shard 0:stmt] seastar - shared_ptr accessed on non-owner cpu, at: 0xd3088ca /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5685f9c /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5685cf7 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5688075 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5688acf /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x44bb40e /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x44bbe9e 0xfd32766 0xfdd73c5 0xfd7d42d 0xfdd4beb 0xfdd6430 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x4715728 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x471dce8 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x4721d80 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x471fa03 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x4258cda /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x425627a 0xd3a2289 0xd39f16f /lib64/libc.so.6+0x27b89 /lib64/libc.so.6+0x27c4a 0xd2c31a4

@bhalevy bhalevy force-pushed the safe-process_on_shard branch from 0f588c5 to 13d0242 Compare February 14, 2024 13:06
@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Feb 14, 2024

In v3 (13d0242):

  • rebased
  • wrapped bounce_to_shard result in foreign_ptr so it can be safely returned from another shard
    • and defined a type for that: result_with_bounce_to_shard
  • cleaned up series:
    • cql_server: move process_fn_return_type to class definition
    • cql_server: connection: process: add template concept for process_fn
    • transport: server: pass bounce_to_shard as foreign shared ptr
    • cql_server: connection: process_on_shard: drop permit parameter
      • this change coroutinizes process()
      • cql_server: connection: process: fixup indentation
    • cql_server::connection do_process_on_shard: rebounce msg if needed
      • complete coroutinization also of process_on_shard
      • bounce in a while loop as @avikivity suggested

@bhalevy bhalevy requested review from avikivity and piodul February 14, 2024 13:10
@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Feb 14, 2024

Would it be hard to write an automated unit test (looks like it might not be easy)?

This is path is exercised in a number of unit tests as evident in #17309 (comment)
But I agree there's no specific test for chained-rebouncing with tablet migration in between (it might need careful fault injection to trigger the migration at exactly the right timing)

Comment thread transport/server.cc Outdated
Comment thread transport/server.cc
future<cql_server::result_with_foreign_response_ptr>
cql_server::connection::process_on_shard(::shared_ptr<messages::result_message::bounce_to_shard> bounce_msg, uint16_t stream, fragmented_temporary_buffer::istream is,
service::client_state& cs, service_permit permit, tracing::trace_state_ptr trace_state, Process process_fn) {
return _server.container().invoke_on(*bounce_msg->move_to_shard(), _server._config.bounce_request_smp_service_group,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in the changelog why bounce_to_shard now needs to move between shards.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in v4

Comment thread transport/server.cc Outdated
(cql_server::process_fn_return_type msg) mutable {
auto msg = co_await process_fn(client_state, _server._query_processor, in, stream,
_version, permit, trace_state, true, {});
// FIXME: indentation
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to dropping permit?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't pass the permit down to process_on_shard we need to hold on to the permit passed to us.
Making this function a coroutine achieves that.
I can do the coroutinization in a preliminary patch if needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least explain it in the changelog

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in v4

Comment thread transport/server.cc
template<typename Process>
requires std::same_as<std::invoke_result_t<Process, service::client_state&, distributed<cql3::query_processor>&, request_reader,
uint16_t, cql_protocol_version_type, service_permit, tracing::trace_state_ptr, bool, cql3::computed_function_values>, future<cql_server::process_fn_return_type>>
future<cql_server::result_with_foreign_response_ptr>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-bounce the msg to another shard if needed,
e.g. in the case of tablet migration.

Please give an example of how tablet migration requires this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. @tgrabiec if you have an example ready I'll be happy to adopt it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bouncing happens when executing LWT statement in modification_statement::execute_with_condition by returning a special result message kind. The code assumes that after jumping to the shard from the bounce request, the result message is the regular one and not yet another bounce. There is no problem with vnodes, because shards don't change. With tablets, they can change at run time on migration.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the above to commitlog in v4

Comment thread transport/server.cc
auto sg = _server._config.bounce_request_smp_service_group;
auto gcs = cs.move_to_other_shard();
auto gt = tracing::global_trace_state_ptr(std::move(trace_state));
co_return co_await _server.container().invoke_on(shard, sg, [&, cached_vals = std::move(cached_vals)] (cql_server& server) mutable -> future<process_fn_return_type> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to simplify it a little - pass cached_vals by value, and here capture it by reference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, because this is exactly how I first implemented it the first time, but I changed it to use rvalue reference to indicate that it's supposed to be passed down the stack to process_fn, and it matches take_cached_pk_function_calls() that returns a temporary object.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not change with this respect in v4.

@avikivity
Copy link
Copy Markdown
Member

Re-bounce the msg to another shard if needed, e.g. in the case of tablet migration.

Fixes #15465

Need a deeper explanation about why tablets caused this to be needed.

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Container Test
✅ - dtest
❌ - Unit Tests

Failed Tests (5/22148):

Build Details:

  • Duration: 4 hr 5 min
  • Builder: i-0e2244f9eea571cd6 (m5ad.12xlarge)

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 2 hr 2 min
  • Builder: spider6.cloudius-systems.com

@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Feb 15, 2024

@yaronkaikov / @benipeled CI failed in #17309 (comment) and passed when retried, in #17309 (comment).
I see that the first instance was much slower than spider6. Do you see that correlation elsewhere?

@yaronkaikov
Copy link
Copy Markdown
Contributor

@yaronkaikov / @benipeled CI failed in #17309 (comment) and passed when retried, in #17309 (comment). I see that the first instance was much slower than spider6. Do you see that correlation elsewhere?

@bhalevy It seems like a flaky test, i see failures only in one mode (out of 9 runs for each). Working on adding some flaky test logic to Gating/CI which in case a test fails only once we will re-run it and in case the re-run passes (10 times) we will mark this build as SUCCESS and open an issue automatic (WIP https://github.com/scylladb/scylla-pkg/pull/3804)

So it can be used for a template concept in the next patch.

Signed-off-by: Benny Halevy <[email protected]>
Quoting Avi Kivity:
> Out of scope: we should consider detemplating this.

As a follow-up we should consider that and pass
a function object as process_fn, just make sure
there are no drawbacks.

Signed-off-by: Benny Halevy <[email protected]>
So it can safely passed between shards, as will be needed
in the following patch that handles a (re)bounce_to_shard result
from process_fn that's called by `process_on_shard` on the
`move_to_shard`.

With that in mind, pass the `bounce_to_shard` payload
to `process_on_shard` rather than the foreign shared ptr
since the latter grabs what it needs from it on entry
and the shared_ptr can be released on the calling shard.

Signed-off-by: Benny Halevy <[email protected]>
It is currently unused in `process_on_shard`, which
generates an empty service_permit.

The next patch may call process_on_shard in a loop,
so it can't simply move the permit to the callee
and better hold on to it until processing completes.

`cql_server::connection::process` was turned into
a coroutine in this patch to hold on to the permit parameter
in a simple way. This is a preliminary step to changing
`if (bounce_msg)` to `while (bounce_msg)` that will allow
rebouncing the message in case it moved yet again when
yielding in `process_on_shard`.

Signed-off-by: Benny Halevy <[email protected]>
Rebounce the msg to another shard if needed,
e.g. in the case of tablet migration.

An example for that, as given by Tomasz Grabiec:
> Bouncing happens when executing LWT statement in
> modification_statement::execute_with_condition by returning a
> special result message kind. The code assumes that after
> jumping to the shard from the bounce request, the result
> message is the regular one and not yet another bounce.
> There is no problem with vnodes, because shards don't change.
> With tablets, they can change at run time on migration.

Fixes scylladb#15465

Signed-off-by: Benny Halevy <[email protected]>
@bhalevy bhalevy force-pushed the safe-process_on_shard branch from 13d0242 to 7cc1866 Compare February 16, 2024 08:32
@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Feb 16, 2024

In v4 (7cc1866):

  • improved commitlog messages
  • fixed patch shortlog for "cql_server::connection: process: rebounce msg if needed"

@scylladb-promoter
Copy link
Copy Markdown
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 1 hr 59 min
  • Builder: spider1.cloudius-systems.com

@bhalevy bhalevy changed the title cql_server::connection do_process_on_shard: rebounce msg if needed cql_server::connection: process: rebounce msg if needed Feb 18, 2024
Comment thread transport/server.cc

template<typename Process>
requires std::same_as<std::invoke_result_t<Process, service::client_state&, distributed<cql3::query_processor>&, request_reader,
uint16_t, cql_protocol_version_type, service_permit, tracing::trace_state_ptr, bool, cql3::computed_function_values>, future<cql_server::process_fn_return_type>>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, a real constraint should start with std::invocable, so that std::invoke_result_t is known to evaluate to something.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't aware of this, I thought it's part of the concept.
But indeed this is the way it's defined in https://en.cppreference.com/w/cpp/types/result_of

the return type of the [Callable](https://en.cppreference.com/w/cpp/named_req/Callable) type F if invoked with the arguments ArgTypes.... Only defined if F can be called with the arguments ArgTypes... in unevaluated context.(since C++14)

@avikivity
Copy link
Copy Markdown
Member

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

  1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.
  2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.
  3. The original motivation for bouncing (in d28dd49) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

I think we should revert d28dd49 instead. Or, if the goal is to reduce contention, we should use the vnode-style static sharding, since at least that's self-consistent).

@gleb-cloudius

@denesb
Copy link
Copy Markdown
Contributor

denesb commented Feb 19, 2024

  1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

Yes, note that in the case of multi-page queries, replicas are sticky for the duration of the query. If the driver decides to hop between coordinators, in certain pages, the coordinator will not be a replica.

@gleb-cloudius
Copy link
Copy Markdown
Contributor

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

But I question the premise for the series. Do we support moving tablet from shard to shard at all. How do we do double writes to the same node in this case? Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

@avikivity
Copy link
Copy Markdown
Member

  1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

Yes, note that in the case of multi-page queries, replicas are sticky for the duration of the query. If the driver decides to hop between coordinators, in certain pages, the coordinator will not be a replica.

This is LWT, so single partition RMW.

@avikivity
Copy link
Copy Markdown
Member

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.

Note holding the mildly named effective_replication_map_ptr prevents migrations.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

@gleb-cloudius
Copy link
Copy Markdown
Contributor

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

Why are you so locked on the lock? You need to unlock yourself from it. Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to. Why are you trying to distinguish between a coordinator and replica? If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.

Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

@avikivity
Copy link
Copy Markdown
Member

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

Why are you so locked on the lock? You need to unlock yourself from it.

I'm trying to understand. Please point to the relevant code.

Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

Only if the coordinator happens to be co-located with a replica, yes?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to.

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

Why are you trying to distinguish between a coordinator and replica?

I'm trying to understand the original code and trying to understand if the new code does something positive.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Possible definitions:

  1. An arbitrary shard, but consistent across calls with the same token on the same node
  2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.
  3. A combination of 1 and 2 (first 2, then 1).

1 is needed to reduce smp calls for a coordinator side lock. Does it exist?
2 is needed to reduce smp calls to the paxos table etc.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.
Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

@gleb-cloudius
Copy link
Copy Markdown
Contributor

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

Why are you so locked on the lock? You need to unlock yourself from it.

I'm trying to understand. Please point to the relevant code.

Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

Only if the coordinator happens to be co-located with a replica, yes?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to.

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.

Why are you trying to distinguish between a coordinator and replica?

I'm trying to understand the original code and trying to understand if the new code does something positive.

For tablets. given that they can be on different shards on different nodes, the only benefit would be locking indeed, but if we do not preserve the same shard on all replicas in tablet load balancer (at least as a final state), then we just harm performance not only for LWT but for regular queries as well. LWT will just suffer much more.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).

Possible definitions:

1. An arbitrary shard, but consistent across calls with the same token on the same node

2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.

3. A combination of 1 and 2 (first 2, then 1).

I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a dht::sharder::shard_of(token). Whatever the call returns is the definition.

1 is needed to reduce smp calls for a coordinator side lock. Does it exist?

Yes.

2 is needed to reduce smp calls to the paxos table etc.

Yes. But for that the coordinator does not have to be co-located with a replica.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.
Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.

It how it should work, but it was deemed impossible with our statement code.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

Hope I provided some.

@nyh
Copy link
Copy Markdown
Contributor

nyh commented Feb 19, 2024

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

I'm also completely baffled by this, especially in the context of Alternator (I opened an Alternator version of this issue in #17399): In Alternator the drivers are not topology aware, so most requests will arrive on a node that doesn't have any replica of the requested token. So on such a coordinator, what is even the "correct shard" where I need to run the LWT request?

I still remember when LWT coordinator requests could be run on any node on any shard. At some point you guys decided it must be run on a specific shard (determined by the "sharding function" on the token), but still any node of the cluster. Is it now necessary to run LWT requests only on a node that actually holds a replica of this token, to allow us to choose the correct shard?

@avikivity
Copy link
Copy Markdown
Member

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

Why are you so locked on the lock? You need to unlock yourself from it.

I'm trying to understand. Please point to the relevant code.

Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

Only if the coordinator happens to be co-located with a replica, yes?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to.

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.

Ok. So for coordinator-only role, the answers are:

  • reduce smp calls for the lock (requires a consistent shard, but no requirement for co-location with actual data shards)
  • optimize rpc in case we have similar shard-counts across the cluster.

The lock in question is paxos_state::_paxos_table_lock.

Why are you trying to distinguish between a coordinator and replica?

I'm trying to understand the original code and trying to understand if the new code does something positive.

For tablets. given that they can be on different shards on different nodes, the only benefit would be locking indeed, but if we do not preserve the same shard on all replicas in tablet load balancer (at least as a final state), then we just harm performance not only for LWT but for regular queries as well. LWT will just suffer much more.

Indeed there is a requirement that the load balancer try to place tablets on the same shard, but it isn't implemented yet. We should make it work when the tablets are on different shards but preserve the optimization when they are.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).

Possible definitions:

1. An arbitrary shard, but consistent across calls with the same token on the same node

2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.

3. A combination of 1 and 2 (first 2, then 1).

I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a dht::sharder::shard_of(token). Whatever the call returns is the definition.

But it's different for tablets.

1 is needed to reduce smp calls for a coordinator side lock. Does it exist?

Yes.

2 is needed to reduce smp calls to the paxos table etc.

Yes. But for that the coordinator does not have to be co-located with a replica.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.
Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.

It how it should work, but it was deemed impossible with our statement code.

That's really sad.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

Hope I provided some.

Yes. So I think the patch is wrong and this is how it should work:

  1. For vnodes, keep the current implementation
  2. For tablets, non-replica coordinator, select a consistent shard (can be token % smp::count)
  3. For tablets, coordinator shard, select the shard that corresponds to the tablet replica on this node

In addition we should hold effective_replication_map_ptr to avoid migration during the operation.

@gleb-cloudius
Copy link
Copy Markdown
Contributor

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.

Ok. So for coordinator-only role, the answers are:

* reduce smp calls for the lock (requires a consistent shard, but no requirement for co-location with actual data shards)

* optimize rpc in case we have similar shard-counts across the cluster.

The lock in question is paxos_state::_paxos_table_lock.

paxos_state::_coordinator_lock. _paxos_table_lock is replica side.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).

Possible definitions:

1. An arbitrary shard, but consistent across calls with the same token on the same node

2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.

3. A combination of 1 and 2 (first 2, then 1).

I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a dht::sharder::shard_of(token). Whatever the call returns is the definition.

But it's different for tablets.

For tablets a coordinator that is not also a replica needs to get a shard of one of the replicas in a deterministic way, for instance with the lowest host id). This way if all replicas are on the same shard we will get rpc optimization. In case if the coordinator is co-located with a replica we need to use local shard since the code assumes it. Do not see how to handle to shards on the same node case though without substantial changes to the coordinator code.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

Hope I provided some.

Yes. So I think the patch is wrong and this is how it should work:

1. For vnodes, keep the current implementation

2. For tablets, non-replica coordinator, select a consistent shard (can be token % smp::count)

3. For tablets, coordinator shard, select the shard that corresponds to the tablet replica on this node

Agree accept 2. I explained above what I think is correct in this case.

@bhalevy bhalevy added area/tablets backport/none Backport is not required labels Mar 20, 2024
@kbr-scylla
Copy link
Copy Markdown
Contributor

@bhalevy my team will take over #15465 and this PR, if you don't mind.

@bhalevy
Copy link
Copy Markdown
Member Author

bhalevy commented Aug 26, 2024

@bhalevy my team will take over #15465 and this PR, if you don't mind.

Be my guest!

@kbr-scylla
Copy link
Copy Markdown
Contributor

Replaced by #20493, closing

@kbr-scylla kbr-scylla closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tablets backport/none Backport is not required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tablets: Shard bouncing should not assume that shards don't change

10 participants