Allow clients to subscribe to slot migrations#10358
Allow clients to subscribe to slot migrations#10358zuiderkwast wants to merge 2 commits intoredis:unstablefrom
Conversation
The cluster bus doesn't distinguish between migrations and other actions. This commit detects a migration by checking if exactly one slot has a new owner. Only one moved slot per cluster bus message or command is notified. This is to avoid flooding the clients. In other cases, such as failovers, clients will need to handle MOVED redirects and update the slot mapping. A special pubsub channel "__redis__:moved" is used (naming style as in client-side caching). The message is a string on the form "MOVED slot endpoint:port" very similar to MOVED redirects. Some special logic added to avoid creating the pubsub message if there are no subscribers.
|
@PingXie @hpatro @madolson @yossigo, please have a look. This is a simple message which helps clients avoid reloading the slot mapping only in the slot migration scenario. Later, we can add another message (same or separate channel) to inform the client that it needs to reload the full slot mapping, in cases like failovers and added/removed replicas. What do you think about this approach? |
@zuiderkwast,the implementation in general looks good to me. I will take another pass next.
This sounds like a reasonable approach. |
|
|
||
| /* Notify clients subscribed to moved slot events. To avoid flooding the | ||
| * clients, we only publish the moved slot message if exactly one slot has | ||
| * been migrated. In cases like failover, clients will receive -MOVED |
There was a problem hiding this comment.
Can we consider sending the MOVED_SLOT_MULTIPLE (-2) message anyways? This would give clients a chance to refresh the cluster topology outside the data serving path while not adding too much overhead.
There was a problem hiding this comment.
We could. We could also notify clients about changed replicas (no slots moved). Mostly we need to decide if we want to send it in the same channel or a different one and the format of the message. WDYT?
There was a problem hiding this comment.
Separate channels sound cleaner to me as it would allow easy filtering. For instance, I wouldn't imagine all clients care about the addition or removal of replicas.
| @@ -4224,7 +4232,7 @@ void clusterBeforeSleep(void) { | |||
| * including replicas. */ | |||
| if (moved_slot >= 0) { | |||
There was a problem hiding this comment.
remove the if (moved_slots >= 0) check? I thought we agreed on sending the special multiple-slots-moved message?
There was a problem hiding this comment.
No we discussed how it should be sent and if it should be in a separate channel. (We can't send "MOVED -2 ip:port". If I just remove the if, the code will crash.)
There was a problem hiding this comment.
If I just remove the if, the code will crash
Curious - was it Redis that crashed or the test? Any challenge/risk in fixing this crash?
it should be in a separate channel
That sounds a bit counter-intuitive to me, if I as a client need to subscribe to two different channels to get the same type of information. The difference between the two channels sounds more like an implementation detail than something conceptual. I was under the impression that the separate channel would be used for different types of events, such as the arrival/departure of a replica.
There was a problem hiding this comment.
If I just remove the if, the code will crash
Curious - was it Redis that crashed or the test? Any challenge/risk in fixing this crash?
clusterNotifyMovedSlot accesses server.cluster->slots[moved_slot]. With a negative index, it's read outside the allocation. Without the if >= 0, it'd also try with -1 when there is no slot moved. I must assume you misunderstand. It needs more implementation if we want to notify for multiple slots and in particular, it needs to be decided what the notification should look like.
it should be in a separate channel
That sounds a bit counter-intuitive to me, if I as a client need to subscribe to two different channels to get the same type of information. The difference between the two channels sounds more like an implementation detail than something conceptual. I was under the impression that the separate channel would be used for different types of events, such as the arrival/departure of a replica.
I thought you preferred a separate channel in this comment where we talked about multiple moved and also other events: #10358 (comment)
There was a problem hiding this comment.
So, exactly what behaviour do you suggest for multiple moved slots? This often means failover but the cluster bus doesn't specify that.
There was a problem hiding this comment.
I misunderstood your comment. I thought it was meant for different types of events.
I think there should be one channel for each type of messages and the type discussed here is "slot ownership". The reason why the slot ownership changes isn't that important to the client (be it slot migration or failover) IMO. The use case I have in mind is to facilitate faster/cheaper routing information update.
There was a problem hiding this comment.
OK, now we sorted out the misunderstandings. :-)
In the case of failover, we might have something like 5000 slots moved at once. Should we send 5000 messages to the subscribers in that case? It is flooding the clients with too much data if we do it in the naive way. We need to send a more compact message which includes a list of slot ranges.
.... or we can just send a message like "RELOAD" (or "INVALIDATE" or "RESHARDED") which tells the client to reload the slot mapping using CLUSTER SLOTS, SHARDS or NODES, whatever the client prefers.
... or we can send no message at all, which will result in a -MOVED redirect sooner or later, which triggers the client to reload the slot mapping. This is my initial choice in this PR, because it is minimalistic and extensible.
|
@zuiderkwast It is not explicitly stated, but I assume the main goals here are:
Am I missing something else here? |
|
@yossigo That's right. Not necessarily a connection pool, but long-running connections, yes. We also get lower latency for the client when redirects are avoided. Clients that reload the complete slot mapping on -MOVED redirects (multiple clients => latency spike) will not have to do that if only a single slot has been moved. Slot migration is one slot at a time. When moving a lot of slots, the slot mapping is updated one slot at a time and without this feature clients would need to reload the full slot mapping many times. If we decide how to notify about multiple slots moved and other changes, I can as well implement it now. |
There was a problem hiding this comment.
I agree with @yossigo that I think we need a broader idea of all the types of topology messages we want to send, so we plan for it. The implementation feels a bit off to me, in that we are sending "moved redirect" messages to clients. It seems like we should be able to send something richer with more information.
| server.cluster->todo_before_sleep = 0; | ||
| server.cluster->moved_slot_since_sleep = MOVED_SLOT_NONE; | ||
| server.cluster->moved_slot_channel = | ||
| createObject(OBJ_STRING, sdsnew("__redis__:moved")); |
There was a problem hiding this comment.
I would suggest something like, __redis__:cluster_moved. We might want to include other cluster notifications like failovers and state changes (like preferred endpoints) in the future, and they should probably be prefixed the same way. (Or perhaps we could also have something like__redis__:cluster_all that includes all notifications as well?
There was a problem hiding this comment.
how about cluster_slot_movedcluster_slots_moved?
| connGetType(c->conn) != CONN_TYPE_TLS; | ||
| } | ||
|
|
||
| /* For redirects, verb must start with a dash, e.g. "-ASK" or "-MOVED". */ |
There was a problem hiding this comment.
We might consider just adding the dash from the calling site instead of documenting it here. The dash is to indicate that it's an error response. (We don't actually even need it I think, since we should append - on errors regardless)
There was a problem hiding this comment.
It is added in the calling site. I can delete this comment if that's what you mean. (If we don't add id, "-ERR " is prepended by the addErrorReply functions.)
|
+1 on defining a complete message type system. As far as the moved channel is concerned what specific information do you have in mind @madolson? If we agree on the use case of facilitating the routing information update, mimicking -ASK/-MOVED makes sense to me. |
@madolson Thx for reviewing. Sounds good. In the cluster bus, a node sends its full info, so if we want to send partial info to the client, we would have to compute some kind of diff. Perhaps it's enough to send the full state for a node. The same with slots. If we don't send all slots for a node, we'd need to compute a diff to find out which slots are moved. (Detecting a single moved slot is simple though.) We could structure a message like a partial CLUSTER SHARDS reply but it would become quite large in some cases, unless we detect events like replica migration (master id changed and no other info changed). The events I think we'd need to detect are failover, replica migration, added node, removed node, state change, slot migration.
I agree. The three reasons I picked the redirects format are 1) pubsub messages are always strings so far, 2) clients are familiar with redirects and already have code for parsing them and 3) the message is very compact. We can of course introduce a RESP-structured pubsub message here. |
|
Coming back to this one. The reason I want to subscribe specifically to the single-slot-migration scenario: When a bunch of slots are migrated, this is done one slot at a time and it can be quite slow. If there are some very active clients, they will run into MOVED redirect after each migrated slot, which typically causes them to reload the slots map. When the next slot is migrated, they will run into another MOVED redirect, and so on. If CLUSTER SLOTS is O(N), the total work done by each client will be O(N*S) where S is the number of slots being moved, if they call CLUSTER SLOTS on each MOVED redirect, after each migrated slot. All the other topology changes (failover, etc.) normally cause the clients to reload the slots mapping only once. In this case, waiting for a redirect before reloading the slot mapping is totally fine IMO. |
|
|
The cluster bus doesn't distinguish between migrations and other actions.
This commit detects a migration by checking if exactly one slot has a new
owner.
Only one moved slot per cluster bus message or command is notified. This
is to avoid flooding the clients. In other cases, such as failovers,
clients will need to handle MOVED redirects and update the slot mapping
as before.
A special pubsub channel
__redis__:movedis used (naming style as inclient-side caching). The message is a string on the form "MOVED slot
endpoint:port" very similar to MOVED redirects.
Some special logic added to avoid creating the pubsub message if there
are no subscribers.
Fixes #10150.