Skip to content

Support delay trimming slots after finishing migrating slots#14567

Merged
ShooterIT merged 22 commits intoredis:unstablefrom
ShooterIT:trim-delay
Dec 16, 2025
Merged

Support delay trimming slots after finishing migrating slots#14567
ShooterIT merged 22 commits intoredis:unstablefrom
ShooterIT:trim-delay

Conversation

@ShooterIT
Copy link
Member

@ShooterIT ShooterIT commented Nov 25, 2025

This PR introduces a mechanism that allows a module to temporarily disable trimming after an ASM migration operation so it can safely finish ongoing asynchronous jobs that depend on keys in migrating (and about to be trimmed) slots.

  1. ClusterDisableTrim/ClusterEnableTrim
    We introduce ClusterDisableTrim/ClusterEnableTrim Module APIs to allow module to disable/enable slot migration

    /* Disable automatic slot trimming. */
    int RM_ClusterDisableTrim(RedisModuleCtx *ctx)
    
    /* Enable automatic slot trimming */
    int RM_ClusterEnableTrim(RedisModuleCtx *ctx)
    

    Please notice: Redis will not start any subsequent import or migrate ASM operations while slot trimming is disabled, so modules must re-enable trimming immediately after completing their pending work.

    The only valid and meaningful time for a module to disable trimming appears to be after the MIGRATE_COMPLETED event.

  2. REDISMODULE_OPEN_KEY_ACCESS_TRIMMED
    Added REDISMODULE_OPEN_KEY_ACCESS_TRIMMED to RM_OpenKey() so that module can operate with these keys in the unowned slots after trim is paused.

    And now we don't delete the key if it is in trim job when we access it. And expireIfNeeded returns KEY_VALID if EXPIRE_ALLOW_ACCESS_TRIMMED is set, otherwise, returns KEY_TRIMMED without deleting key.

  3. REDISMODULE_CTX_FLAGS_TRIM_IN_PROGRESS
    We also extend RM_GetContextFlags() to include a flag REDISMODULE_CTX_FLAGS_TRIM_IN_PROGRESS indicating whether a trimming job is pending (due to trim pause) or in progress. Modules could periodically poll this flag to synchronize their internal state, e.g., if a trim job was delayed or if the module incorrectly assumed trimming was still active.

Bugfix: RM_SetClusterFlags could not clear a flag after enabling it first

/* On migrating side, the replica is unaware of the slot migration, so it
* doesn't know when the migration is completed (start having unowned keys),
* We delay to fire the event until the TRIMSLOTS is propagated to the replica.
* Maybe optimize it in the future. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud, let's say migrating side replica becomes master before receiving TRIMSLOTS.

  1. If it happens after replica receives config update via bus, then the new master will be unaware of unowned keys.
  2. If it happens before replica receives config update via bus, then, on the config update, the new master will do a blocking trim as there is no asm task to associate.

I assume we don't have these issues on importing side. Also, I wonder what we can do to improve these two.

@ShooterIT ShooterIT changed the title Trim delay Support delay trimming slots after finishing migrating slots Nov 27, 2025
@ShooterIT
Copy link
Member Author

ShooterIT commented Dec 1, 2025

Hi @oranagra This PR introduce two APIs:

/* Acquire a server capability that changes the normal behavior of Redis.
 * Capabilities are reference-counted: a capability may be acquired
 * multiple times concurrently by multiple modules.
 *
 * Currently supported capabilities:
 *   - REDISMODULE_SERVER_CAPA_NO_TRIM:
 *       Prevent the server from trimming keys after atomic slot migration.
 *
 * Returns REDISMODULE_OK on success, REDISMODULE_ERR otherwise. */
int RM_AcquireServerCapability(RedisModuleCtx *ctx, int capa) 

/* Release a previously acquired server capability.
 *
 * This function must be called once for each successful call to
 * RM_AcquireServerCapability() with the same capability. The underlying
 * behavior is restored when the reference count for that capability
 * reaches zero.
 *
 * Returns REDISMODULE_OK on success, REDISMODULE_ERR otherwise. */
int RM_ReleaseServerCapability(RedisModuleCtx *ctx, int capa)

RedisModule_AcquireServerCapability acquires a server capability that alters the default behavior of the Redis server. Capabilities are reference-counted: the same capability may be acquired multiple times concurrently by multiple modules.

RedisModule_ReleaseServerCapability releases a previously acquired capability. The underlying behavior is restored only when the last reference to that capability is released.

At the moment, the only supported capability is CLUSTER_MODULE_POLICY_NO_TRIM, which prevents key trimming after atomic slot migration.

These APIs are intentionally generic so that additional server capabilities can be introduced in the future, instead of providing dedicated functions such as disableTrim and enableTrim.

Do you think these APIs make sense?

@ShooterIT ShooterIT added the release-notes indication that this issue needs to be mentioned in the release notes label Dec 1, 2025
@oranagra
Copy link
Member

oranagra commented Dec 1, 2025

LGTM except for the terminology. both the "acquire" and more importantly the "capability" terms don't seem good to me.
maybe the key focus is on this (taken from the doc comment): "changes the normal behavior"
how about it could be "DisableServerMechanism" or alike, but i suppose in some cases it'll be to enable something that's disabled by default.
so maybe "ChangeServerBehavior"?
or maybe we're working too hard to prepare for the future, and we should just add an explicit API for just this mechanism (i.e. RM_DisableReshardTriming"

@ShooterIT
Copy link
Member Author

Thank you @oranagra

we're working too hard to prepare for the future

yes, I had a long discussion with chatpgt, and did not get satisfactory conclusion, the reason why I use acquire and release is that we use a counter to describe it.

maybe a dedicated API is not a bad idea, and I even prefer it, The word “reshard” may be confused with our internal resharding approach, how about RM_DisableSlotMigrationTrimming ?

and WDYT? @tezc @JoanFM

@tezc
Copy link
Collaborator

tezc commented Dec 2, 2025

I was the one mainly insisted on generic API as I thought we could need something similar in future. Though, now I feel like it causes more confusion and each one of us has different taste :) So, I vote for dedicated API as well. I think all of the cluster functions include cluster in it, maybe we can all it something like RM_ClusterDisableTrim()? I don't have a strong opinion though.

@JoanFM
Copy link
Collaborator

JoanFM commented Dec 2, 2025

Thank you @oranagra

we're working too hard to prepare for the future

yes, I had a long discussion with chatpgt, and did not get satisfactory conclusion, the reason why I use acquire and release is that we use a counter to describe it.

maybe a dedicated API is not a bad idea, and I even prefer it, The word “reshard” may be confused with our internal resharding approach, how about RM_DisableSlotMigrationTrimming ?

and WDYT? @tezc @JoanFM

So I would use whatever, but I agree that this generic API name can feel a little confusing and an overkill trying to generalize for now.

So dedicated API seems best to me

@ShooterIT
Copy link
Member Author

Thank you all, let's use dedicated APIs, RM_ClusterDisableTrim looks good to me

@ShooterIT ShooterIT requested a review from tezc December 15, 2025 13:24
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM

@ShooterIT ShooterIT merged commit 33391a7 into redis:unstable Dec 16, 2025
19 checks passed
@ShooterIT ShooterIT deleted the trim-delay branch December 16, 2025 08:31
@sundb sundb added this to Redis 8.6 Dec 17, 2025
@sundb sundb moved this to Done in Redis 8.6 Dec 17, 2025
YaacovHazan pushed a commit that referenced this pull request Dec 25, 2025
This PR introduces a mechanism that allows a module to temporarily
disable trimming after an ASM migration operation so it can safely
finish ongoing asynchronous jobs that depend on keys in migrating (and
about to be trimmed) slots.

1. **ClusterDisableTrim/ClusterEnableTrim**
We introduce `ClusterDisableTrim/ClusterEnableTrim` Module APIs to allow
module to disable/enable slot migration
    ```
    /* Disable automatic slot trimming. */
    int RM_ClusterDisableTrim(RedisModuleCtx *ctx)

    /* Enable automatic slot trimming */
    int RM_ClusterEnableTrim(RedisModuleCtx *ctx)
    ```

**Please notice**: Redis will not start any subsequent import or migrate
ASM operations while slot trimming is disabled, so modules must
re-enable trimming immediately after completing their pending work.

The only valid and meaningful time for a module to disable trimming
appears to be after the MIGRATE_COMPLETED event.

2. **REDISMODULE_OPEN_KEY_ACCESS_TRIMMED**
Added REDISMODULE_OPEN_KEY_ACCESS_TRIMMED to RM_OpenKey() so that module
can operate with these keys in the unowned slots after trim is paused.

And now we don't delete the key if it is in trim job when we access it.
And `expireIfNeeded` returns `KEY_VALID` if
`EXPIRE_ALLOW_ACCESS_TRIMMED` is set, otherwise, returns `KEY_TRIMMED`
without deleting key.

3. **REDISMODULE_CTX_FLAGS_TRIM_IN_PROGRESS**
We also extend RM_GetContextFlags() to include a flag
REDISMODULE_CTX_FLAGS_TRIM_IN_PROGRESS indicating whether a trimming job
is pending (due to trim pause) or in progress. Modules could
periodically poll this flag to synchronize their internal state, e.g.,
if a trim job was delayed or if the module incorrectly assumed trimming
was still active.

Bugfix: RM_SetClusterFlags could not clear a flag after enabling it first.

---------

Co-authored-by: Ozan Tezcan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants