Skip to content

Allow clients to subscribe to slot migrations#10358

Open
zuiderkwast wants to merge 2 commits intoredis:unstablefrom
zuiderkwast:subscribe-slots
Open

Allow clients to subscribe to slot migrations#10358
zuiderkwast wants to merge 2 commits intoredis:unstablefrom
zuiderkwast:subscribe-slots

Conversation

@zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Feb 28, 2022

The cluster bus doesn't distinguish between migrations and other actions.
This commit detects a migration by checking if exactly one slot has a new
owner.

Only one moved slot per cluster bus message or command is notified. This
is to avoid flooding the clients. In other cases, such as failovers,
clients will need to handle MOVED redirects and update the slot mapping
as before.

A special pubsub channel __redis__:moved is used (naming style as in
client-side caching). The message is a string on the form "MOVED slot
endpoint:port" very similar to MOVED redirects.

Some special logic added to avoid creating the pubsub message if there
are no subscribers.

Fixes #10150.

The cluster bus doesn't distinguish between migrations and other actions.
This commit detects a migration by checking if exactly one slot has a new
owner.

Only one moved slot per cluster bus message or command is notified. This
is to avoid flooding the clients. In other cases, such as failovers,
clients will need to handle MOVED redirects and update the slot mapping.

A special pubsub channel "__redis__:moved" is used (naming style as in
client-side caching). The message is a string on the form "MOVED slot
endpoint:port" very similar to MOVED redirects.

Some special logic added to avoid creating the pubsub message if there
are no subscribers.
@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Feb 28, 2022

@PingXie @hpatro @madolson @yossigo, please have a look.

This is a simple message which helps clients avoid reloading the slot mapping only in the slot migration scenario.

Later, we can add another message (same or separate channel) to inform the client that it needs to reload the full slot mapping, in cases like failovers and added/removed replicas. What do you think about this approach?

@PingXie
Copy link
Contributor

PingXie commented Mar 2, 2022

@PingXie @hpatro @madolson @yossigo, please have a look.

This is a simple message which helps clients avoid reloading the slot mapping only in the slot migration scenario.

@zuiderkwast,the implementation in general looks good to me. I will take another pass next.

Later, we can add another message (same or separate channel) to inform the client that it needs to repload the full slot mapping, in cases like failovers and added/removed replicas. What do you think about this approach?

This sounds like a reasonable approach.


/* Notify clients subscribed to moved slot events. To avoid flooding the
* clients, we only publish the moved slot message if exactly one slot has
* been migrated. In cases like failover, clients will receive -MOVED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider sending the MOVED_SLOT_MULTIPLE (-2) message anyways? This would give clients a chance to refresh the cluster topology outside the data serving path while not adding too much overhead.

Copy link
Contributor Author

@zuiderkwast zuiderkwast Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. We could also notify clients about changed replicas (no slots moved). Mostly we need to decide if we want to send it in the same channel or a different one and the format of the message. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate channels sound cleaner to me as it would allow easy filtering. For instance, I wouldn't imagine all clients care about the addition or removal of replicas.

@zuiderkwast zuiderkwast marked this pull request as ready for review March 5, 2022 06:52
@@ -4224,7 +4232,7 @@ void clusterBeforeSleep(void) {
* including replicas. */
if (moved_slot >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the if (moved_slots >= 0) check? I thought we agreed on sending the special multiple-slots-moved message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we discussed how it should be sent and if it should be in a separate channel. (We can't send "MOVED -2 ip:port". If I just remove the if, the code will crash.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I just remove the if, the code will crash

Curious - was it Redis that crashed or the test? Any challenge/risk in fixing this crash?

it should be in a separate channel

That sounds a bit counter-intuitive to me, if I as a client need to subscribe to two different channels to get the same type of information. The difference between the two channels sounds more like an implementation detail than something conceptual. I was under the impression that the separate channel would be used for different types of events, such as the arrival/departure of a replica.

Copy link
Contributor Author

@zuiderkwast zuiderkwast Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I just remove the if, the code will crash

Curious - was it Redis that crashed or the test? Any challenge/risk in fixing this crash?

clusterNotifyMovedSlot accesses server.cluster->slots[moved_slot]. With a negative index, it's read outside the allocation. Without the if >= 0, it'd also try with -1 when there is no slot moved. I must assume you misunderstand. It needs more implementation if we want to notify for multiple slots and in particular, it needs to be decided what the notification should look like.

it should be in a separate channel

That sounds a bit counter-intuitive to me, if I as a client need to subscribe to two different channels to get the same type of information. The difference between the two channels sounds more like an implementation detail than something conceptual. I was under the impression that the separate channel would be used for different types of events, such as the arrival/departure of a replica.

I thought you preferred a separate channel in this comment where we talked about multiple moved and also other events: #10358 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, exactly what behaviour do you suggest for multiple moved slots? This often means failover but the cluster bus doesn't specify that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood your comment. I thought it was meant for different types of events.

I think there should be one channel for each type of messages and the type discussed here is "slot ownership". The reason why the slot ownership changes isn't that important to the client (be it slot migration or failover) IMO. The use case I have in mind is to facilitate faster/cheaper routing information update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now we sorted out the misunderstandings. :-)

In the case of failover, we might have something like 5000 slots moved at once. Should we send 5000 messages to the subscribers in that case? It is flooding the clients with too much data if we do it in the naive way. We need to send a more compact message which includes a list of slot ranges.

.... or we can just send a message like "RELOAD" (or "INVALIDATE" or "RESHARDED") which tells the client to reload the slot mapping using CLUSTER SLOTS, SHARDS or NODES, whatever the client prefers.

... or we can send no message at all, which will result in a -MOVED redirect sooner or later, which triggers the client to reload the slot mapping. This is my initial choice in this PR, because it is minimalistic and extensible.

@yossigo
Copy link
Collaborator

yossigo commented Mar 7, 2022

@zuiderkwast It is not explicitly stated, but I assume the main goals here are:

  • Avoid potential latency spike with lazy processing of -MOVED (as a redirect, or as a trigger for CLUSTER SLOTS)
  • Optimize clients that maintain a connection pool and can use a single connection for this kind of signalling.

Am I missing something else here?
Regarding additional messages or channels, I think we should define it now even if the implementation is partial because it takes time for clients to catch up.

@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Mar 7, 2022

@yossigo That's right. Not necessarily a connection pool, but long-running connections, yes.

We also get lower latency for the client when redirects are avoided.

Clients that reload the complete slot mapping on -MOVED redirects (multiple clients => latency spike) will not have to do that if only a single slot has been moved. Slot migration is one slot at a time. When moving a lot of slots, the slot mapping is updated one slot at a time and without this feature clients would need to reload the full slot mapping many times.

If we decide how to notify about multiple slots moved and other changes, I can as well implement it now.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @yossigo that I think we need a broader idea of all the types of topology messages we want to send, so we plan for it. The implementation feels a bit off to me, in that we are sending "moved redirect" messages to clients. It seems like we should be able to send something richer with more information.

server.cluster->todo_before_sleep = 0;
server.cluster->moved_slot_since_sleep = MOVED_SLOT_NONE;
server.cluster->moved_slot_channel =
createObject(OBJ_STRING, sdsnew("__redis__:moved"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest something like, __redis__:cluster_moved. We might want to include other cluster notifications like failovers and state changes (like preferred endpoints) in the future, and they should probably be prefixed the same way. (Or perhaps we could also have something like__redis__:cluster_all that includes all notifications as well?

Copy link
Contributor

@PingXie PingXie Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about cluster_slot_moved cluster_slots_moved?

connGetType(c->conn) != CONN_TYPE_TLS;
}

/* For redirects, verb must start with a dash, e.g. "-ASK" or "-MOVED". */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might consider just adding the dash from the calling site instead of documenting it here. The dash is to indicate that it's an error response. (We don't actually even need it I think, since we should append - on errors regardless)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is added in the calling site. I can delete this comment if that's what you mean. (If we don't add id, "-ERR " is prepended by the addErrorReply functions.)

@PingXie
Copy link
Contributor

PingXie commented Mar 9, 2022

+1 on defining a complete message type system. As far as the moved channel is concerned what specific information do you have in mind @madolson? If we agree on the use case of facilitating the routing information update, mimicking -ASK/-MOVED makes sense to me.

@zuiderkwast
Copy link
Contributor Author

zuiderkwast commented Mar 9, 2022

I agree with @yossigo that I think we need a broader idea of all the types of topology messages we want to send, so we plan for it.

@madolson Thx for reviewing. Sounds good. In the cluster bus, a node sends its full info, so if we want to send partial info to the client, we would have to compute some kind of diff. Perhaps it's enough to send the full state for a node. The same with slots. If we don't send all slots for a node, we'd need to compute a diff to find out which slots are moved. (Detecting a single moved slot is simple though.)

We could structure a message like a partial CLUSTER SHARDS reply but it would become quite large in some cases, unless we detect events like replica migration (master id changed and no other info changed). The events I think we'd need to detect are failover, replica migration, added node, removed node, state change, slot migration.

The implementation feels a bit off to me, in that we are sending "moved redirect" messages to clients. It seems like we should be able to send something richer with more information.

I agree. The three reasons I picked the redirects format are 1) pubsub messages are always strings so far, 2) clients are familiar with redirects and already have code for parsing them and 3) the message is very compact.

We can of course introduce a RESP-structured pubsub message here.

@zuiderkwast
Copy link
Contributor Author

Coming back to this one.

The reason I want to subscribe specifically to the single-slot-migration scenario: When a bunch of slots are migrated, this is done one slot at a time and it can be quite slow. If there are some very active clients, they will run into MOVED redirect after each migrated slot, which typically causes them to reload the slots map. When the next slot is migrated, they will run into another MOVED redirect, and so on.

If CLUSTER SLOTS is O(N), the total work done by each client will be O(N*S) where S is the number of slots being moved, if they call CLUSTER SLOTS on each MOVED redirect, after each migrated slot.

All the other topology changes (failover, etc.) normally cause the clients to reload the slots mapping only once. In this case, waiting for a redirect before reloading the slot mapping is totally fine IMO.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLUSTER SUBSCRIBE SLOTS (topology changes)

5 participants