MSC3944: Dropping stale send-to-device messages#3944
MSC3944: Dropping stale send-to-device messages#3944uhoreg wants to merge 4 commits intomatrix-org:mainfrom
Conversation
|
This looks great, and would help us out over at Beeper I think. We've been seeing clients (especially Element Web) generate a fair amount of room key chatter and mobile apps that aren't syncing actively at the time end up getting a large backlog of requests and cancellations when they do sync again. Since Synapse also limits the number of to_device events to 100 per response, this results in us having to do several syncs back-to-back until we can consider the app "caught up". |
| - `m.room_key_request` with `action: request`, if there is a corresponding | ||
| `m.room_key_request` message with `action: request_cancellation` and the same | ||
| `request_id` and `requesting_device_id` fields, sent by the same user. If | ||
| the request message is dropped, the cancellation message is dropped as well. | ||
| - `m.secret.request` with `action: request`, if there is a corresponding | ||
| `m.secret.request` with `action: request_cancellation` and the same | ||
| `request_id` and `requesting_device_id` fields, sent by the same user. If | ||
| the request message is dropped, the cancellation message is dropped as well. | ||
| - `m.key.verification.request`, if there is a corresponding | ||
| `m.key.verification.cancel` with the same `transaction_id` field, sent by the | ||
| same user. If the request message is dropped, the cancellation message is | ||
| dropped as well. | ||
| - `m.key.verification.request` when the `timestamp` is more than 10 minutes in | ||
| the past. Since clients are to ignore verification requests that are older | ||
| than this, they can be safely dropped. The server may wish to remember the | ||
| `transaction_id` and sender for some time, so that it can also drop the | ||
| `m.key.verification.cancel` message with the same `transaction_id` and | ||
| sender, if such a message gets sent. |
There was a problem hiding this comment.
It's a bit frustrating that we need to set out special rules for each type of to-device message, rather than laying out a generic framework for cancellable and expiring to-device messages. It feels like something of a layering violation, and feels like we're always going to be extending this list (with consequent extra work for server implementations).
| the `requesting_device_id` (in the case of `m.room_key_request` and | ||
| `m.secret.request`), or `device_id` (in the case of | ||
| `m.key.verification.request`) refers to a device that no longer exists. | ||
|
|
There was a problem hiding this comment.
I would extend this to include semantically equivalent to-device messages. There's real world evidence that some clients will send request cancellations with the exact same request ID / requesting device ID many (>40) times.
| @@ -0,0 +1,70 @@ | |||
| # MSC3944: Dropping stale send-to-device messages | |||
There was a problem hiding this comment.
I'm going to implement this in the sliding sync proxy because it is aligned with the core bandwidth / speed goals of MSC3575. Before implementing this though, I did some analysis on the size/kinds of to-device events that stack up on the proxy for clients:
- The largest stack had a huge 237852 to-device events pending. This is uncommon, and rapidly drops off. The median is somewhere in the 100s.
- Breaking down by event type, the vast majority by several orders of magnitudes are
m.room_key_requestevents:
event_type | count
----------------------------+--------
m.room_key_request | 121067
m.room.encrypted | 256
m.secret.request | 51
m.key.verification.cancel | 16
m.key.verification.request | 13
- By following the semantics in this MSC, I found that:
28089 to-device messages
14495 unique device|req_id tuples
1564 outstanding requests
12931 cancelled requests
260 multi-cancelled requests
potential savings: removing 26525 messages or 94.431984050696 %
- "multi-cancelled" requests are identical request cancellation events for the same request ID / requesting device ID.
- The 94% figure is consistent on large (100,000s events) accounts as well as mid-range (10,000s events) accounts. The smaller the stack, the less the benefit. On a device with 204 pending events, this MSC would remove 99 of them (48%).
There is real and immediate benefit to be had by implementing this MSC everywhere, particularly for m.room_key_request events which is the event type I did this analysis on.
|
From poking around, it seems like this was never implemented, is that correct? The tickets above got a little mangled in the migration from matrix-org/synapse to element-hq/synapse so it's not super clear what the state is. Is this MSC still relevant in a sliding-sync world? |
|
Yes it's still relevant. It landed in the proxy but didn't land in the native Synapse implementation. |
Rendered