cql_server::connection: process: rebounce msg if needed#17309
cql_server::connection: process: rebounce msg if needed#17309bhalevy wants to merge 6 commits intoscylladb:masterfrom
Conversation
piodul
left a comment
There was a problem hiding this comment.
Would it be hard to write an automated unit test (looks like it might not be easy)?
🔴 CI State: FAILURE✅ - Build Failed Tests (1695/23664):
Build Details:
|
97fed73 to
0f588c5
Compare
|
In v2 (0f588c5)
|
🔴 CI State: FAILURE✅ - Build Failed Tests (2136/24307):
Build Details:
|
|
CI failed in debug mode due to: |
0f588c5 to
13d0242
Compare
|
In v3 (13d0242):
|
This is path is exercised in a number of unit tests as evident in #17309 (comment) |
| future<cql_server::result_with_foreign_response_ptr> | ||
| cql_server::connection::process_on_shard(::shared_ptr<messages::result_message::bounce_to_shard> bounce_msg, uint16_t stream, fragmented_temporary_buffer::istream is, | ||
| service::client_state& cs, service_permit permit, tracing::trace_state_ptr trace_state, Process process_fn) { | ||
| return _server.container().invoke_on(*bounce_msg->move_to_shard(), _server._config.bounce_request_smp_service_group, |
There was a problem hiding this comment.
Please explain in the changelog why bounce_to_shard now needs to move between shards.
| (cql_server::process_fn_return_type msg) mutable { | ||
| auto msg = co_await process_fn(client_state, _server._query_processor, in, stream, | ||
| _version, permit, trace_state, true, {}); | ||
| // FIXME: indentation |
There was a problem hiding this comment.
How is this related to dropping permit?
There was a problem hiding this comment.
Since we don't pass the permit down to process_on_shard we need to hold on to the permit passed to us.
Making this function a coroutine achieves that.
I can do the coroutinization in a preliminary patch if needed.
There was a problem hiding this comment.
At least explain it in the changelog
| template<typename Process> | ||
| requires std::same_as<std::invoke_result_t<Process, service::client_state&, distributed<cql3::query_processor>&, request_reader, | ||
| uint16_t, cql_protocol_version_type, service_permit, tracing::trace_state_ptr, bool, cql3::computed_function_values>, future<cql_server::process_fn_return_type>> | ||
| future<cql_server::result_with_foreign_response_ptr> |
There was a problem hiding this comment.
Re-bounce the msg to another shard if needed,
e.g. in the case of tablet migration.
Please give an example of how tablet migration requires this.
There was a problem hiding this comment.
okay. @tgrabiec if you have an example ready I'll be happy to adopt it.
There was a problem hiding this comment.
Bouncing happens when executing LWT statement in modification_statement::execute_with_condition by returning a special result message kind. The code assumes that after jumping to the shard from the bounce request, the result message is the regular one and not yet another bounce. There is no problem with vnodes, because shards don't change. With tablets, they can change at run time on migration.
There was a problem hiding this comment.
Added the above to commitlog in v4
| auto sg = _server._config.bounce_request_smp_service_group; | ||
| auto gcs = cs.move_to_other_shard(); | ||
| auto gt = tracing::global_trace_state_ptr(std::move(trace_state)); | ||
| co_return co_await _server.container().invoke_on(shard, sg, [&, cached_vals = std::move(cached_vals)] (cql_server& server) mutable -> future<process_fn_return_type> { |
There was a problem hiding this comment.
It's possible to simplify it a little - pass cached_vals by value, and here capture it by reference.
There was a problem hiding this comment.
Funny, because this is exactly how I first implemented it the first time, but I changed it to use rvalue reference to indicate that it's supposed to be passed down the stack to process_fn, and it matches take_cached_pk_function_calls() that returns a temporary object.
There was a problem hiding this comment.
did not change with this respect in v4.
Need a deeper explanation about why tablets caused this to be needed. |
🔴 CI State: FAILURE✅ - Build Failed Tests (5/22148):
Build Details:
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
|
@yaronkaikov / @benipeled CI failed in #17309 (comment) and passed when retried, in #17309 (comment). |
@bhalevy It seems like a flaky test, i see failures only in one mode (out of 9 runs for each). Working on adding some flaky test logic to Gating/CI which in case a test fails only once we will re-run it and in case the re-run passes (10 times) we will mark this build as SUCCESS and open an issue automatic (WIP https://github.com/scylladb/scylla-pkg/pull/3804) |
So it can be used for a template concept in the next patch. Signed-off-by: Benny Halevy <[email protected]>
Quoting Avi Kivity: > Out of scope: we should consider detemplating this. As a follow-up we should consider that and pass a function object as process_fn, just make sure there are no drawbacks. Signed-off-by: Benny Halevy <[email protected]>
So it can safely passed between shards, as will be needed in the following patch that handles a (re)bounce_to_shard result from process_fn that's called by `process_on_shard` on the `move_to_shard`. With that in mind, pass the `bounce_to_shard` payload to `process_on_shard` rather than the foreign shared ptr since the latter grabs what it needs from it on entry and the shared_ptr can be released on the calling shard. Signed-off-by: Benny Halevy <[email protected]>
It is currently unused in `process_on_shard`, which generates an empty service_permit. The next patch may call process_on_shard in a loop, so it can't simply move the permit to the callee and better hold on to it until processing completes. `cql_server::connection::process` was turned into a coroutine in this patch to hold on to the permit parameter in a simple way. This is a preliminary step to changing `if (bounce_msg)` to `while (bounce_msg)` that will allow rebouncing the message in case it moved yet again when yielding in `process_on_shard`. Signed-off-by: Benny Halevy <[email protected]>
Signed-off-by: Benny Halevy <[email protected]>
Rebounce the msg to another shard if needed, e.g. in the case of tablet migration. An example for that, as given by Tomasz Grabiec: > Bouncing happens when executing LWT statement in > modification_statement::execute_with_condition by returning a > special result message kind. The code assumes that after > jumping to the shard from the bounce request, the result > message is the regular one and not yet another bounce. > There is no problem with vnodes, because shards don't change. > With tablets, they can change at run time on migration. Fixes scylladb#15465 Signed-off-by: Benny Halevy <[email protected]>
13d0242 to
7cc1866
Compare
|
In v4 (7cc1866):
|
🟢 CI State: SUCCESS✅ - Build Build Details:
|
|
|
||
| template<typename Process> | ||
| requires std::same_as<std::invoke_result_t<Process, service::client_state&, distributed<cql3::query_processor>&, request_reader, | ||
| uint16_t, cql_protocol_version_type, service_permit, tracing::trace_state_ptr, bool, cql3::computed_function_values>, future<cql_server::process_fn_return_type>> |
There was a problem hiding this comment.
Note, a real constraint should start with std::invocable, so that std::invoke_result_t is known to evaluate to something.
There was a problem hiding this comment.
Oh, I wasn't aware of this, I thought it's part of the concept.
But indeed this is the way it's defined in https://en.cppreference.com/w/cpp/types/result_of
the return type of the [Callable](https://en.cppreference.com/w/cpp/named_req/Callable) type F if invoked with the arguments ArgTypes.... Only defined if F can be called with the arguments ArgTypes... in unevaluated context.(since C++14)
|
The series looks good wrt what it's trying to achieve, but now I have doubts about what it's trying to achieve.
I think we should revert d28dd49 instead. Or, if the goal is to reduce contention, we should use the vnode-style static sharding, since at least that's self-consistent). |
Yes, note that in the case of multi-page queries, replicas are sticky for the duration of the query. If the driver decides to hop between coordinators, in certain pages, the coordinator will not be a replica. |
It is to avoid a lot of smp calls. Taking a lock on the correct shard is a matter of jumping to it one more time.
And do what? A lot of smp calls? LWT is not performant as it is so lets kill it even more. The whole bounce story was introduced because out statement code is unbelievable piece of crap, Something that suppose to be completely const cannot be passed to the other shard for processing (or at least it was the case back then). Otherwise instead of bouncing we would just jump to a correct shard when it becomes knows does the stack. But I question the premise for the series. Do we support moving tablet from shard to shard at all. How do we do double writes to the same node in this case? Even if we do the tablet management code suppose to wait for all old queries to complete before the movement, so how shard may change in the middle of a query? |
This is LWT, so single partition RMW. |
Is this a coordinator side lock?
How is "correct shard" defined here? Arbitrary (but stable) shard per coordinator? Or is it only defined if the coordinator is co-located with a replica?
Currently not, but it is planned. Note currently a tablet can be migrated to another node and returned to the original node but another shard. Note holding the mildly named effective_replication_map_ptr prevents migrations.
Still unknown.
It cannot change if we hold the lock correctly. |
Why are you so locked on the lock? You need to unlock yourself from it. Besides the lock there is a paxos state that is in the local table co-located with the shard data is on.
Correct shard is a shard data belongs to. Why are you trying to distinguish between a coordinator and replica? If the shard will not be correct on a coordinator each paxos/data table access will need cross shard access.
FWIW the bounce is handled to high in the stack for that unfortunately. But we can introduce something else.
We should work on that :) |
I'm trying to understand. Please point to the relevant code.
Only if the coordinator happens to be co-located with a replica, yes?
There is no shard that the data belongs to on the coordinator. On replicas, it's a different shard on each replica. This code executes on the coordinator.
I'm trying to understand the original code and trying to understand if the new code does something positive.
To understand if the shard is correct, I need to understand the definition of "correct shard". It can't be "the shard data belongs to" because it returns different results on different nodes. Possible definitions:
1 is needed to reduce smp calls for a coordinator side lock. Does it exist?
Or we can push some of the handling lower down to storage_proxy. I think it would be cleaner, but probably more work.
I agree holding the lock is better here. But I need answers to my questions. |
For vnodes, given that all nodes in the cluster have the same shard count, data will be on the same shard on all replicas. Regardless of the fact that the coordinator does not have any data if the coordinator code runs on the correct shard the reads an writes it will send to replicas will be executed on the correct shard (rpc magic) without additional smp call and unlike regular query there will be quite a few. There is also a key locking that will not be as efficient if it will be done per shard, but that not the only reason.
For tablets. given that they can be on different shards on different nodes, the only benefit would be locking indeed, but if we do not preserve the same shard on all replicas in tablet load balancer (at least as a final state), then we just harm performance not only for LWT but for regular queries as well. LWT will just suffer much more.
Not for vnode. And we are discussing here existing code (you asked me about specific old commit, no?) because I cannot see why we need this patch as well (at least for now).
I am sure you remember that for vnode you can determine the shard the data will belong to on the replica (given same number of shards) even if a node itself is not replica. There is a
Yes.
Yes. But for that the coordinator does not have to be co-located with a replica.
It how it should work, but it was deemed impossible with our statement code.
Hope I provided some. |
I'm also completely baffled by this, especially in the context of Alternator (I opened an Alternator version of this issue in #17399): In Alternator the drivers are not topology aware, so most requests will arrive on a node that doesn't have any replica of the requested token. So on such a coordinator, what is even the "correct shard" where I need to run the LWT request? I still remember when LWT coordinator requests could be run on any node on any shard. At some point you guys decided it must be run on a specific shard (determined by the "sharding function" on the token), but still any node of the cluster. Is it now necessary to run LWT requests only on a node that actually holds a replica of this token, to allow us to choose the correct shard? |
Ok. So for coordinator-only role, the answers are:
The lock in question is paxos_state::_paxos_table_lock.
Indeed there is a requirement that the load balancer try to place tablets on the same shard, but it isn't implemented yet. We should make it work when the tablets are on different shards but preserve the optimization when they are.
But it's different for tablets.
That's really sad.
Yes. So I think the patch is wrong and this is how it should work:
In addition we should hold effective_replication_map_ptr to avoid migration during the operation. |
For tablets a coordinator that is not also a replica needs to get a shard of one of the replicas in a deterministic way, for instance with the lowest host id). This way if all replicas are on the same shard we will get rpc optimization. In case if the coordinator is co-located with a replica we need to use local shard since the code assumes it. Do not see how to handle to shards on the same node case though without substantial changes to the coordinator code.
Agree accept 2. I explained above what I think is correct in this case. |
|
Replaced by #20493, closing |
Re-bounce the msg to another shard if needed,
e.g. in the case of tablet migration.
Fixes #15465