cql_server::connection: process: rebounce msg if needed by bhalevy · Pull Request #17309 · scylladb/scylladb

bhalevy · 2024-02-13T11:23:42Z

Re-bounce the msg to another shard if needed,
e.g. in the case of tablet migration.

Fixes #15465

piodul

Would it be hard to write an automated unit test (looks like it might not be easy)?

scylladb-promoter · 2024-02-13T17:12:31Z

🔴 CI State: FAILURE

✅ - Build
✅ - Container Test
✅ - dtest
❌ - Unit Tests

Failed Tests (1695/23664):

Build Details:

Duration: 3 hr 22 min
Builder: i-011607f47f8c52942 (m5d.8xlarge)

bhalevy · 2024-02-14T01:03:39Z

In v2 (0f588c5)

rewritten coroutines to turn co-recursion into a while loop
did some preliminary refactoring to organize the code churn:
- cql_server: move process_fn_return_type to class definition
- cql_server: connection: process: add template concept for process_fn

scylladb-promoter · 2024-02-14T03:03:44Z

🔴 CI State: FAILURE

✅ - Build
✅ - Container Test
✅ - dtest
❌ - Unit Tests

Failed Tests (2136/24307):

Build Details:

Duration: 2 hr 2 min
Builder: spider5.cloudius-systems.com

bhalevy · 2024-02-14T10:05:37Z

CI failed in debug mode due to:

ERROR 2024-02-14 03:42:28,855 [shard 0:stmt] seastar - shared_ptr accessed on non-owner cpu, at: 0xd3088ca /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5685f9c /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5685cf7 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5688075 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x5688acf /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x44bb40e /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x44bbe9e 0xfd32766 0xfdd73c5 0xfd7d42d 0xfdd4beb 0xfdd6430 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x4715728 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x471dce8 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x4721d80 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x471fa03 /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x4258cda /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x425627a 0xd3a2289 0xd39f16f /lib64/libc.so.6+0x27b89 /lib64/libc.so.6+0x27c4a 0xd2c31a4

bhalevy · 2024-02-14T13:10:49Z

In v3 (13d0242):

rebased
wrapped bounce_to_shard result in foreign_ptr so it can be safely returned from another shard
- and defined a type for that: result_with_bounce_to_shard
cleaned up series:
- cql_server: move process_fn_return_type to class definition
- cql_server: connection: process: add template concept for process_fn
- transport: server: pass bounce_to_shard as foreign shared ptr
- cql_server: connection: process_on_shard: drop permit parameter
  - this change coroutinizes process()
  - cql_server: connection: process: fixup indentation
- cql_server::connection do_process_on_shard: rebounce msg if needed
  - complete coroutinization also of process_on_shard
  - bounce in a while loop as @avikivity suggested

bhalevy · 2024-02-14T13:12:54Z

Would it be hard to write an automated unit test (looks like it might not be easy)?

This is path is exercised in a number of unit tests as evident in #17309 (comment)
But I agree there's no specific test for chained-rebouncing with tablet migration in between (it might need careful fault injection to trigger the migration at exactly the right timing)

avikivity · 2024-02-14T14:22:27Z

 future<cql_server::result_with_foreign_response_ptr>
-cql_server::connection::process_on_shard(::shared_ptr<messages::result_message::bounce_to_shard> bounce_msg, uint16_t stream, fragmented_temporary_buffer::istream is,
-        service::client_state& cs, service_permit permit, tracing::trace_state_ptr trace_state, Process process_fn) {
-    return _server.container().invoke_on(*bounce_msg->move_to_shard(), _server._config.bounce_request_smp_service_group,


Please explain in the changelog why bounce_to_shard now needs to move between shards.

avikivity · 2024-02-14T14:24:34Z

-                   (cql_server::process_fn_return_type msg) mutable {
+    auto msg = co_await process_fn(client_state, _server._query_processor, in, stream,
+            _version, permit, trace_state, true, {});
+        // FIXME: indentation


How is this related to dropping permit?

Since we don't pass the permit down to process_on_shard we need to hold on to the permit passed to us.
Making this function a coroutine achieves that.
I can do the coroutinization in a preliminary patch if needed.

At least explain it in the changelog

avikivity · 2024-02-14T14:25:22Z

 template<typename Process>
 requires std::same_as<std::invoke_result_t<Process, service::client_state&, distributed<cql3::query_processor>&, request_reader,
        uint16_t, cql_protocol_version_type, service_permit, tracing::trace_state_ptr, bool, cql3::computed_function_values>, future<cql_server::process_fn_return_type>>
-future<cql_server::result_with_foreign_response_ptr>


Re-bounce the msg to another shard if needed,
e.g. in the case of tablet migration.

Please give an example of how tablet migration requires this.

okay. @tgrabiec if you have an example ready I'll be happy to adopt it.

Bouncing happens when executing LWT statement in modification_statement::execute_with_condition by returning a special result message kind. The code assumes that after jumping to the shard from the bounce request, the result message is the regular one and not yet another bounce. There is no problem with vnodes, because shards don't change. With tablets, they can change at run time on migration.

Added the above to commitlog in v4

avikivity · 2024-02-14T14:27:50Z

+    auto sg = _server._config.bounce_request_smp_service_group;
+    auto gcs = cs.move_to_other_shard();
+    auto gt = tracing::global_trace_state_ptr(std::move(trace_state));
+    co_return co_await _server.container().invoke_on(shard, sg, [&, cached_vals = std::move(cached_vals)] (cql_server& server) mutable -> future<process_fn_return_type> {


It's possible to simplify it a little - pass cached_vals by value, and here capture it by reference.

Funny, because this is exactly how I first implemented it the first time, but I changed it to use rvalue reference to indicate that it's supposed to be passed down the stack to process_fn, and it matches take_cached_pk_function_calls() that returns a temporary object.

did not change with this respect in v4.

avikivity · 2024-02-14T14:30:56Z

Re-bounce the msg to another shard if needed, e.g. in the case of tablet migration.

Fixes #15465

Need a deeper explanation about why tablets caused this to be needed.

scylladb-promoter · 2024-02-14T17:12:38Z

🔴 CI State: FAILURE

✅ - Build
✅ - Container Test
✅ - dtest
❌ - Unit Tests

Failed Tests (5/22148):

Build Details:

Duration: 4 hr 5 min
Builder: i-0e2244f9eea571cd6 (m5ad.12xlarge)

scylladb-promoter · 2024-02-15T01:15:41Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 2 hr 2 min
Builder: spider6.cloudius-systems.com

bhalevy · 2024-02-15T07:26:54Z

@yaronkaikov / @benipeled CI failed in #17309 (comment) and passed when retried, in #17309 (comment).
I see that the first instance was much slower than spider6. Do you see that correlation elsewhere?

yaronkaikov · 2024-02-15T07:55:01Z

@yaronkaikov / @benipeled CI failed in #17309 (comment) and passed when retried, in #17309 (comment). I see that the first instance was much slower than spider6. Do you see that correlation elsewhere?

@bhalevy It seems like a flaky test, i see failures only in one mode (out of 9 runs for each). Working on adding some flaky test logic to Gating/CI which in case a test fails only once we will re-run it and in case the re-run passes (10 times) we will mark this build as SUCCESS and open an issue automatic (WIP https://github.com/scylladb/scylla-pkg/pull/3804)

So it can be used for a template concept in the next patch. Signed-off-by: Benny Halevy <[email protected]>

Quoting Avi Kivity: > Out of scope: we should consider detemplating this. As a follow-up we should consider that and pass a function object as process_fn, just make sure there are no drawbacks. Signed-off-by: Benny Halevy <[email protected]>

So it can safely passed between shards, as will be needed in the following patch that handles a (re)bounce_to_shard result from process_fn that's called by `process_on_shard` on the `move_to_shard`. With that in mind, pass the `bounce_to_shard` payload to `process_on_shard` rather than the foreign shared ptr since the latter grabs what it needs from it on entry and the shared_ptr can be released on the calling shard. Signed-off-by: Benny Halevy <[email protected]>

It is currently unused in `process_on_shard`, which generates an empty service_permit. The next patch may call process_on_shard in a loop, so it can't simply move the permit to the callee and better hold on to it until processing completes. `cql_server::connection::process` was turned into a coroutine in this patch to hold on to the permit parameter in a simple way. This is a preliminary step to changing `if (bounce_msg)` to `while (bounce_msg)` that will allow rebouncing the message in case it moved yet again when yielding in `process_on_shard`. Signed-off-by: Benny Halevy <[email protected]>

Signed-off-by: Benny Halevy <[email protected]>

Rebounce the msg to another shard if needed, e.g. in the case of tablet migration. An example for that, as given by Tomasz Grabiec: > Bouncing happens when executing LWT statement in > modification_statement::execute_with_condition by returning a > special result message kind. The code assumes that after > jumping to the shard from the bounce request, the result > message is the regular one and not yet another bounce. > There is no problem with vnodes, because shards don't change. > With tablets, they can change at run time on migration. Fixes scylladb#15465 Signed-off-by: Benny Halevy <[email protected]>

bhalevy · 2024-02-16T08:33:52Z

In v4 (7cc1866):

improved commitlog messages
fixed patch shortlog for "cql_server::connection: process: rebounce msg if needed"

scylladb-promoter · 2024-02-16T10:32:15Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 1 hr 59 min
Builder: spider1.cloudius-systems.com

avikivity · 2024-02-18T20:13:11Z


 template<typename Process>
+requires std::same_as<std::invoke_result_t<Process, service::client_state&, distributed<cql3::query_processor>&, request_reader,
+        uint16_t, cql_protocol_version_type, service_permit, tracing::trace_state_ptr, bool, cql3::computed_function_values>, future<cql_server::process_fn_return_type>>


Note, a real constraint should start with std::invocable, so that std::invoke_result_t is known to evaluate to something.

Oh, I wasn't aware of this, I thought it's part of the concept.
But indeed this is the way it's defined in https://en.cppreference.com/w/cpp/types/result_of

the return type of the [Callable](https://en.cppreference.com/w/cpp/named_req/Callable) type F if invoked with the arguments ArgTypes.... Only defined if F can be called with the arguments ArgTypes... in unevaluated context.(since C++14)

avikivity · 2024-02-18T20:29:46Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.
There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.
The original motivation for bouncing (in d28dd49) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

I think we should revert d28dd49 instead. Or, if the goal is to reduce contention, we should use the vnode-style static sharding, since at least that's self-consistent).

@gleb-cloudius

denesb · 2024-02-19T08:19:45Z

Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

Yes, note that in the case of multi-page queries, replicas are sticky for the duration of the query. If the driver decides to hop between coordinators, in certain pages, the coordinator will not be a replica.

gleb-cloudius · 2024-02-19T08:23:51Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

But I question the premise for the series. Do we support moving tablet from shard to shard at all. How do we do double writes to the same node in this case? Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

avikivity · 2024-02-19T08:39:45Z

Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

Yes, note that in the case of multi-page queries, replicas are sticky for the duration of the query. If the driver decides to hop between coordinators, in certain pages, the coordinator will not be a replica.

This is LWT, so single partition RMW.

avikivity · 2024-02-19T08:45:52Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.

Note holding the mildly named effective_replication_map_ptr prevents migrations.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

gleb-cloudius · 2024-02-19T09:12:10Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.

1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)

It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.

Is this a coordinator side lock?

Why are you so locked on the lock? You need to unlock yourself from it. Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to. Why are you trying to distinguish between a coordinator and replica? If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.

Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

avikivity · 2024-02-19T11:17:33Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.
1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)
It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.
Is this a coordinator side lock?
Why are you so locked on the lock? You need to unlock yourself from it.

I'm trying to understand. Please point to the relevant code.

Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

Only if the coordinator happens to be co-located with a replica, yes?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to.

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

Why are you trying to distinguish between a coordinator and replica?

I'm trying to understand the original code and trying to understand if the new code does something positive.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Possible definitions:

An arbitrary shard, but consistent across calls with the same token on the same node
If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.
A combination of 1 and 2 (first 2, then 1).

1 is needed to reduce smp calls for a coordinator side lock. Does it exist?
2 is needed to reduce smp calls to the paxos table etc.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.
Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

gleb-cloudius · 2024-02-19T11:57:37Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.
1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)
It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.
Is this a coordinator side lock?
Why are you so locked on the lock? You need to unlock yourself from it.
I'm trying to understand. Please point to the relevant code.

Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

Only if the coordinator happens to be co-located with a replica, yes?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to.

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.

Why are you trying to distinguish between a coordinator and replica?

I'm trying to understand the original code and trying to understand if the new code does something positive.

For tablets. given that they can be on different shards on different nodes, the only benefit would be locking indeed, but if we do not preserve the same shard on all replicas in tablet load balancer (at least as a final state), then we just harm performance not only for LWT but for regular queries as well. LWT will just suffer much more.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).

Possible definitions:

1. An arbitrary shard, but consistent across calls with the same token on the same node

2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.

3. A combination of 1 and 2 (first 2, then 1).

I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a dht::sharder::shard_of(token). Whatever the call returns is the definition.

1 is needed to reduce smp calls for a coordinator side lock. Does it exist?

Yes.

2 is needed to reduce smp calls to the paxos table etc.

Yes. But for that the coordinator does not have to be co-located with a replica.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.
Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.

It how it should work, but it was deemed impossible with our statement code.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

Hope I provided some.

nyh · 2024-02-19T12:35:05Z

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

I'm also completely baffled by this, especially in the context of Alternator (I opened an Alternator version of this issue in #17399): In Alternator the drivers are not topology aware, so most requests will arrive on a node that doesn't have any replica of the requested token. So on such a coordinator, what is even the "correct shard" where I need to run the LWT request?

I still remember when LWT coordinator requests could be run on any node on any shard. At some point you guys decided it must be run on a specific shard (determined by the "sharding function" on the token), but still any node of the cluster. Is it now necessary to run LWT requests only on a node that actually holds a replica of this token, to allow us to choose the correct shard?

avikivity · 2024-02-19T13:39:25Z

The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.
1. Nothing says the coordinator is co-located on a replica. The shard can even be out-of-bounds if the replica has more shards than the coordinator and the shard number happens to be too large.

2. There is no correct shard. Each tablet replica has its own shard number. That's also true for vnodes, but only rarely.

3. The original motivation for bouncing (in [d28dd49](https://github.com/scylladb/scylladb/commit/d28dd4957b930af4e202b4776a993cd5474f2ed8)) is unclear. Is it to avoid smp calls? Is it to attempt to consolidate locking to a single shard (and so reduce contention?)
It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.
Is this a coordinator side lock?
Why are you so locked on the lock? You need to unlock yourself from it.
I'm trying to understand. Please point to the relevant code.

Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.

Only if the coordinator happens to be co-located with a replica, yes?

I think we should revert d28dd49 instead.

And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack.

How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?

Correct shard is a shard data belongs to.

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.
For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.

Ok. So for coordinator-only role, the answers are:

reduce smp calls for the lock (requires a consistent shard, but no requirement for co-location with actual data shards)
optimize rpc in case we have similar shard-counts across the cluster.

The lock in question is paxos_state::_paxos_table_lock.

Why are you trying to distinguish between a coordinator and replica?

I'm trying to understand the original code and trying to understand if the new code does something positive.

For tablets. given that they can be on different shards on different nodes, the only benefit would be locking indeed, but if we do not preserve the same shard on all replicas in tablet load balancer (at least as a final state), then we just harm performance not only for LWT but for regular queries as well. LWT will just suffer much more.

Indeed there is a requirement that the load balancer try to place tablets on the same shard, but it isn't implemented yet. We should make it work when the tablets are on different shards but preserve the optimization when they are.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).
Possible definitions:
1. An arbitrary shard, but consistent across calls with the same token on the same node

2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.

3. A combination of 1 and 2 (first 2, then 1).
I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a dht::sharder::shard_of(token). Whatever the call returns is the definition.

But it's different for tablets.

1 is needed to reduce smp calls for a coordinator side lock. Does it exist?

Yes.

2 is needed to reduce smp calls to the paxos table etc.

Yes. But for that the coordinator does not have to be co-located with a replica.

But I question the premise for the series. Do we support moving tablet from shard to shard at all.

Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard.
Note holding the mildly named effective_replication_map_ptr prevents migrations.

FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.

Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.

It how it should work, but it was deemed impossible with our statement code.

That's really sad.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

Hope I provided some.

Yes. So I think the patch is wrong and this is how it should work:

For vnodes, keep the current implementation
For tablets, non-replica coordinator, select a consistent shard (can be token % smp::count)
For tablets, coordinator shard, select the shard that corresponds to the tablet replica on this node

In addition we should hold effective_replication_map_ptr to avoid migration during the operation.

gleb-cloudius · 2024-02-19T15:59:58Z

There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.

For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.

Ok. So for coordinator-only role, the answers are:
* reduce smp calls for the lock (requires a consistent shard, but no requirement for co-location with actual data shards)

* optimize rpc in case we have similar shard-counts across the cluster.
The lock in question is paxos_state::_paxos_table_lock.

paxos_state::_coordinator_lock. _paxos_table_lock is replica side.

If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.

To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes.

Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).
Possible definitions:
1. An arbitrary shard, but consistent across calls with the same token on the same node

2. If the coordinator is co-located with a replica, then the shard that owns the data on that replica, otherwise an arbitrary shard.

3. A combination of 1 and 2 (first 2, then 1).
I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a dht::sharder::shard_of(token). Whatever the call returns is the definition.
But it's different for tablets.

For tablets a coordinator that is not also a replica needs to get a shard of one of the replicas in a deterministic way, for instance with the lowest host id). This way if all replicas are on the same shard we will get rpc optimization. In case if the coordinator is co-located with a replica we need to use local shard since the code assumes it. Do not see how to handle to shards on the same node case though without substantial changes to the coordinator code.

How do we do double writes to the same node in this case?

Still unknown.

Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query?

It cannot change if we hold the lock correctly.

We should work on that :)

I agree holding the lock is better here. But I need answers to my questions.

Hope I provided some.

Yes. So I think the patch is wrong and this is how it should work:
1. For vnodes, keep the current implementation

2. For tablets, non-replica coordinator, select a consistent shard (can be token % smp::count)

3. For tablets, coordinator shard, select the shard that corresponds to the tablet replica on this node

Agree accept 2. I explained above what I think is correct in this case.

kbr-scylla · 2024-08-26T11:23:11Z

@bhalevy my team will take over #15465 and this PR, if you don't mind.

bhalevy · 2024-08-26T11:35:14Z

@bhalevy my team will take over #15465 and this PR, if you don't mind.

Be my guest!

kbr-scylla · 2024-09-10T09:48:35Z

Replaced by #20493, closing

bhalevy requested review from piodul and tgrabiec February 13, 2024 11:23

piodul requested changes Feb 13, 2024

View reviewed changes

Comment thread transport/server.cc Outdated

Comment thread transport/server.cc Outdated

avikivity reviewed Feb 13, 2024

View reviewed changes

Comment thread transport/server.cc Outdated

bhalevy force-pushed the safe-process_on_shard branch from 97fed73 to 0f588c5 Compare February 14, 2024 01:00

bhalevy force-pushed the safe-process_on_shard branch from 0f588c5 to 13d0242 Compare February 14, 2024 13:06

bhalevy requested review from avikivity and piodul February 14, 2024 13:10

avikivity reviewed Feb 14, 2024

View reviewed changes

Comment thread transport/server.cc Outdated

avikivity reviewed Feb 14, 2024

View reviewed changes

bhalevy added 6 commits February 16, 2024 09:39

cql_server: move process_fn_return_type to class definition

7d24383

So it can be used for a template concept in the next patch. Signed-off-by: Benny Halevy <[email protected]>

cql_server: connection: process: fixup indentation

aab415b

Signed-off-by: Benny Halevy <[email protected]>

bhalevy force-pushed the safe-process_on_shard branch from 13d0242 to 7cc1866 Compare February 16, 2024 08:32

bhalevy changed the title ~~cql_server::connection do_process_on_shard: rebounce msg if needed~~ cql_server::connection: process: rebounce msg if needed Feb 18, 2024

avikivity reviewed Feb 18, 2024

View reviewed changes

This was referenced Feb 19, 2024

tablets: Shard bouncing should not assume that shards don't change #15465

Closed

tablets, alternator: Shard bouncing should not assume that shards don't change #17399

Closed

bhalevy added area/tablets backport/none Backport is not required labels Mar 20, 2024

This was referenced Sep 9, 2024

cql_server::connection: Process rebounce message in case of multiple shard migrations #20493

Merged

Change rules for shard selection for LWT query execution with Tablets. #20497

Closed

kbr-scylla closed this Sep 10, 2024

Conversation

bhalevy commented Feb 13, 2024

Uh oh!

piodul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

scylladb-promoter commented Feb 13, 2024

🔴 CI State: FAILURE

Failed Tests (1695/23664):

Build Details:

Uh oh!

Uh oh!

bhalevy commented Feb 14, 2024

Uh oh!

scylladb-promoter commented Feb 14, 2024

🔴 CI State: FAILURE

Failed Tests (2136/24307):

Build Details:

Uh oh!

bhalevy commented Feb 14, 2024

Uh oh!

bhalevy commented Feb 14, 2024

Uh oh!

bhalevy commented Feb 14, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avikivity commented Feb 14, 2024

Uh oh!

scylladb-promoter commented Feb 14, 2024

🔴 CI State: FAILURE

Failed Tests (5/22148):

Build Details:

Uh oh!

scylladb-promoter commented Feb 15, 2024

🟢 CI State: SUCCESS

Build Details:

Uh oh!

bhalevy commented Feb 15, 2024

Uh oh!

yaronkaikov commented Feb 15, 2024

Uh oh!

bhalevy commented Feb 16, 2024

Uh oh!

scylladb-promoter commented Feb 16, 2024

🟢 CI State: SUCCESS

Build Details:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment