Allow clients to subscribe to slot migrations by zuiderkwast · Pull Request #10358 · redis/redis

zuiderkwast · 2022-02-28T10:46:18Z

The cluster bus doesn't distinguish between migrations and other actions.
This commit detects a migration by checking if exactly one slot has a new
owner.

Only one moved slot per cluster bus message or command is notified. This
is to avoid flooding the clients. In other cases, such as failovers,
clients will need to handle MOVED redirects and update the slot mapping
as before.

A special pubsub channel __redis__:moved is used (naming style as in
client-side caching). The message is a string on the form "MOVED slot
endpoint:port" very similar to MOVED redirects.

Some special logic added to avoid creating the pubsub message if there
are no subscribers.

Fixes #10150.

The cluster bus doesn't distinguish between migrations and other actions. This commit detects a migration by checking if exactly one slot has a new owner. Only one moved slot per cluster bus message or command is notified. This is to avoid flooding the clients. In other cases, such as failovers, clients will need to handle MOVED redirects and update the slot mapping. A special pubsub channel "__redis__:moved" is used (naming style as in client-side caching). The message is a string on the form "MOVED slot endpoint:port" very similar to MOVED redirects. Some special logic added to avoid creating the pubsub message if there are no subscribers.

zuiderkwast · 2022-02-28T15:08:56Z

@PingXie @hpatro @madolson @yossigo, please have a look.

This is a simple message which helps clients avoid reloading the slot mapping only in the slot migration scenario.

Later, we can add another message (same or separate channel) to inform the client that it needs to reload the full slot mapping, in cases like failovers and added/removed replicas. What do you think about this approach?

PingXie · 2022-03-02T03:33:23Z

@PingXie @hpatro @madolson @yossigo, please have a look.

This is a simple message which helps clients avoid reloading the slot mapping only in the slot migration scenario.

@zuiderkwast,the implementation in general looks good to me. I will take another pass next.

Later, we can add another message (same or separate channel) to inform the client that it needs to repload the full slot mapping, in cases like failovers and added/removed replicas. What do you think about this approach?

This sounds like a reasonable approach.

src/cluster.c

PingXie · 2022-03-03T07:08:36Z

src/cluster.c


+    /* Notify clients subscribed to moved slot events. To avoid flooding the
+     * clients, we only publish the moved slot message if exactly one slot has
+     * been migrated. In cases like failover, clients will receive -MOVED


Can we consider sending the MOVED_SLOT_MULTIPLE (-2) message anyways? This would give clients a chance to refresh the cluster topology outside the data serving path while not adding too much overhead.

We could. We could also notify clients about changed replicas (no slots moved). Mostly we need to decide if we want to send it in the same channel or a different one and the format of the message. WDYT?

Separate channels sound cleaner to me as it would allow easy filtering. For instance, I wouldn't imagine all clients care about the addition or removal of replicas.

src/cluster.c

PingXie · 2022-03-06T04:08:11Z

src/cluster.c

@@ -4224,7 +4232,7 @@ void clusterBeforeSleep(void) {
     * including replicas. */
    if (moved_slot >= 0) {


remove the if (moved_slots >= 0) check? I thought we agreed on sending the special multiple-slots-moved message?

No we discussed how it should be sent and if it should be in a separate channel. (We can't send "MOVED -2 ip:port". If I just remove the if, the code will crash.)

If I just remove the if, the code will crash

Curious - was it Redis that crashed or the test? Any challenge/risk in fixing this crash?

it should be in a separate channel

That sounds a bit counter-intuitive to me, if I as a client need to subscribe to two different channels to get the same type of information. The difference between the two channels sounds more like an implementation detail than something conceptual. I was under the impression that the separate channel would be used for different types of events, such as the arrival/departure of a replica.

If I just remove the if, the code will crash

Curious - was it Redis that crashed or the test? Any challenge/risk in fixing this crash?

clusterNotifyMovedSlot accesses server.cluster->slots[moved_slot]. With a negative index, it's read outside the allocation. Without the if >= 0, it'd also try with -1 when there is no slot moved. I must assume you misunderstand. It needs more implementation if we want to notify for multiple slots and in particular, it needs to be decided what the notification should look like.

it should be in a separate channel

That sounds a bit counter-intuitive to me, if I as a client need to subscribe to two different channels to get the same type of information. The difference between the two channels sounds more like an implementation detail than something conceptual. I was under the impression that the separate channel would be used for different types of events, such as the arrival/departure of a replica.

I thought you preferred a separate channel in this comment where we talked about multiple moved and also other events: #10358 (comment)

So, exactly what behaviour do you suggest for multiple moved slots? This often means failover but the cluster bus doesn't specify that.

I misunderstood your comment. I thought it was meant for different types of events.

I think there should be one channel for each type of messages and the type discussed here is "slot ownership". The reason why the slot ownership changes isn't that important to the client (be it slot migration or failover) IMO. The use case I have in mind is to facilitate faster/cheaper routing information update.

OK, now we sorted out the misunderstandings. :-)

In the case of failover, we might have something like 5000 slots moved at once. Should we send 5000 messages to the subscribers in that case? It is flooding the clients with too much data if we do it in the naive way. We need to send a more compact message which includes a list of slot ranges.

.... or we can just send a message like "RELOAD" (or "INVALIDATE" or "RESHARDED") which tells the client to reload the slot mapping using CLUSTER SLOTS, SHARDS or NODES, whatever the client prefers.

... or we can send no message at all, which will result in a -MOVED redirect sooner or later, which triggers the client to reload the slot mapping. This is my initial choice in this PR, because it is minimalistic and extensible.

yossigo · 2022-03-07T15:35:05Z

@zuiderkwast It is not explicitly stated, but I assume the main goals here are:

Avoid potential latency spike with lazy processing of -MOVED (as a redirect, or as a trigger for CLUSTER SLOTS)
Optimize clients that maintain a connection pool and can use a single connection for this kind of signalling.

Am I missing something else here?
Regarding additional messages or channels, I think we should define it now even if the implementation is partial because it takes time for clients to catch up.

zuiderkwast · 2022-03-07T17:08:02Z

@yossigo That's right. Not necessarily a connection pool, but long-running connections, yes.

We also get lower latency for the client when redirects are avoided.

Clients that reload the complete slot mapping on -MOVED redirects (multiple clients => latency spike) will not have to do that if only a single slot has been moved. Slot migration is one slot at a time. When moving a lot of slots, the slot mapping is updated one slot at a time and without this feature clients would need to reload the full slot mapping many times.

If we decide how to notify about multiple slots moved and other changes, I can as well implement it now.

madolson

I agree with @yossigo that I think we need a broader idea of all the types of topology messages we want to send, so we plan for it. The implementation feels a bit off to me, in that we are sending "moved redirect" messages to clients. It seems like we should be able to send something richer with more information.

madolson · 2022-03-08T20:35:12Z

src/cluster.c

    server.cluster->todo_before_sleep = 0;
+    server.cluster->moved_slot_since_sleep = MOVED_SLOT_NONE;
+    server.cluster->moved_slot_channel =
+        createObject(OBJ_STRING, sdsnew("__redis__:moved"));


I would suggest something like, __redis__:cluster_moved. We might want to include other cluster notifications like failovers and state changes (like preferred endpoints) in the future, and they should probably be prefixed the same way. (Or perhaps we could also have something like__redis__:cluster_all that includes all notifications as well?

how about ~~cluster_slot_moved~~ cluster_slots_moved?

madolson · 2022-03-08T20:56:19Z

src/cluster.c

+        connGetType(c->conn) != CONN_TYPE_TLS;
+}
+
+/* For redirects, verb must start with a dash, e.g. "-ASK" or "-MOVED". */


We might consider just adding the dash from the calling site instead of documenting it here. The dash is to indicate that it's an error response. (We don't actually even need it I think, since we should append - on errors regardless)

It is added in the calling site. I can delete this comment if that's what you mean. (If we don't add id, "-ERR " is prepended by the addErrorReply functions.)

PingXie · 2022-03-09T03:32:52Z

+1 on defining a complete message type system. As far as the moved channel is concerned what specific information do you have in mind @madolson? If we agree on the use case of facilitating the routing information update, mimicking -ASK/-MOVED makes sense to me.

zuiderkwast · 2022-03-09T09:08:06Z

I agree with @yossigo that I think we need a broader idea of all the types of topology messages we want to send, so we plan for it.

@madolson Thx for reviewing. Sounds good. In the cluster bus, a node sends its full info, so if we want to send partial info to the client, we would have to compute some kind of diff. Perhaps it's enough to send the full state for a node. The same with slots. If we don't send all slots for a node, we'd need to compute a diff to find out which slots are moved. (Detecting a single moved slot is simple though.)

We could structure a message like a partial CLUSTER SHARDS reply but it would become quite large in some cases, unless we detect events like replica migration (master id changed and no other info changed). The events I think we'd need to detect are failover, replica migration, added node, removed node, state change, slot migration.

The implementation feels a bit off to me, in that we are sending "moved redirect" messages to clients. It seems like we should be able to send something richer with more information.

I agree. The three reasons I picked the redirects format are 1) pubsub messages are always strings so far, 2) clients are familiar with redirects and already have code for parsing them and 3) the message is very compact.

We can of course introduce a RESP-structured pubsub message here.

zuiderkwast · 2022-08-23T09:38:46Z

Coming back to this one.

The reason I want to subscribe specifically to the single-slot-migration scenario: When a bunch of slots are migrated, this is done one slot at a time and it can be quite slow. If there are some very active clients, they will run into MOVED redirect after each migrated slot, which typically causes them to reload the slots map. When the next slot is migrated, they will run into another MOVED redirect, and so on.

If CLUSTER SLOTS is O(N), the total work done by each client will be O(N*S) where S is the number of slots being moved, if they call CLUSTER SLOTS on each MOVED redirect, after each migrated slot.

All the other topology changes (failover, etc.) normally cause the clients to reload the slots mapping only once. In this case, waiting for a redirect before reloading the slot mapping is totally fine IMO.

CLAassistant · 2024-03-24T23:14:34Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

PingXie reviewed Mar 3, 2022

View reviewed changes

PingXie mentioned this pull request Mar 3, 2022

Lock down "__redis__:*" system channels for Redis use only #10379

Closed

Address comments and some cleanup

5bfed1f

zuiderkwast marked this pull request as ready for review March 5, 2022 06:52

PingXie reviewed Mar 6, 2022

View reviewed changes

madolson reviewed Mar 8, 2022

View reviewed changes

slorello89 mentioned this pull request Apr 12, 2022

"-READONLY You can't write against a read only replica." on ScriptEvaluate(...) StackExchange/StackExchange.Redis#2074

Closed

zuiderkwast mentioned this pull request Aug 23, 2022

Modules: Events for cluster topology changes #11121

Open

zuiderkwast mentioned this pull request Jun 28, 2023

replica redirect read&write to master in standalone mode #12192

Closed

madolson mentioned this pull request Aug 17, 2023

Redis Cluster Specification as a universal future interface #12493

Open

		@@ -4224,7 +4232,7 @@ void clusterBeforeSleep(void) {
		* including replicas. */
		if (moved_slot >= 0) {

Conversation

zuiderkwast commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingXie commented Mar 2, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yossigo commented Mar 7, 2022

Uh oh!

zuiderkwast commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madolson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PingXie Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PingXie commented Mar 9, 2022

Uh oh!

zuiderkwast commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Aug 23, 2022

Uh oh!

CLAassistant commented Mar 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zuiderkwast commented Feb 28, 2022 •

edited

Loading

zuiderkwast commented Feb 28, 2022 •

edited

Loading

zuiderkwast Mar 3, 2022 •

edited

Loading

zuiderkwast Mar 7, 2022 •

edited

Loading

zuiderkwast commented Mar 7, 2022 •

edited

Loading

madolson left a comment •

edited

Loading

PingXie Mar 9, 2022 •

edited

Loading

zuiderkwast commented Mar 9, 2022 •

edited

Loading