Skip to content

fix(cluster): Cluster reconnect sharded subscribers#2060

Merged
PavelPashov merged 2 commits intoredis:mainfrom
PavelPashov:feat/cluster-reconnect-on-refresh-failure
Jan 15, 2026
Merged

fix(cluster): Cluster reconnect sharded subscribers#2060
PavelPashov merged 2 commits intoredis:mainfrom
PavelPashov:feat/cluster-reconnect-on-refresh-failure

Conversation

@PavelPashov
Copy link
Contributor

@PavelPashov PavelPashov commented Jan 13, 2026

When all sharded subscriber connections fail and the subsequent slots cache refresh returns ClusterAllFailedError, the cluster now properly enters reconnecting state instead of becoming zombied. This occurs when the cluster topology changes and all nodes are replaced with new IPs - the subscriber connections fail, triggering a slots refresh via the -node event handler. If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS times out or closes), the cluster becomes stuck in ready state with no working connections because normal pool connections use lazyConnect: true and never emit end events to trigger the drain->close->reconnect cycle. Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback that detects ClusterAllFailedError and calls disconnect(true) to force reconnection, preventing the zombie state.

@PavelPashov PavelPashov force-pushed the feat/cluster-reconnect-on-refresh-failure branch from 6b76f15 to 00f3d4b Compare January 14, 2026 16:02
@jit-ci
Copy link

jit-ci bot commented Jan 14, 2026

❌ Security scan failed

Security scan failed: Branch feat/cluster-reconnect-on-refresh-failure does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

@PavelPashov PavelPashov changed the title feat(cluster): add reconnectOnRefreshFailure option feat(cluster): Cluster reconnect sharded subscribers Jan 14, 2026
@PavelPashov PavelPashov marked this pull request as ready for review January 14, 2026 17:02
@jit-ci
Copy link

jit-ci bot commented Jan 14, 2026

❌ Security scan failed

Security scan failed: Branch feat/cluster-reconnect-on-refresh-failure does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

When all sharded subscriber connections fail and the subsequent slots cache
refresh returns ClusterAllFailedError, the cluster now properly enters
reconnecting state instead of becoming zombied. This occurs when the cluster
topology changes and all nodes are replaced with new IPs - the subscriber
connections fail, triggering a slots refresh via the "-node" event handler.
If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS
times out or closes), the cluster becomes stuck in "ready" state with no
working connections because normal pool connections use lazyConnect: true
and never emit "end" events to trigger the drain->close->reconnect cycle.
Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback
that detects ClusterAllFailedError and calls disconnect(true) to force
reconnection, preventing the zombie state.
@PavelPashov PavelPashov force-pushed the feat/cluster-reconnect-on-refresh-failure branch from aac0707 to 5e703f6 Compare January 15, 2026 09:08
@jit-ci
Copy link

jit-ci bot commented Jan 15, 2026

❌ Security scan failed

Security scan failed: Branch feat/cluster-reconnect-on-refresh-failure does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

@PavelPashov PavelPashov changed the title feat(cluster): Cluster reconnect sharded subscribers fix(cluster): Cluster reconnect sharded subscribers Jan 15, 2026
@PavelPashov PavelPashov merged commit def9804 into redis:main Jan 15, 2026
11 checks passed
@PavelPashov PavelPashov deleted the feat/cluster-reconnect-on-refresh-failure branch January 15, 2026 11:16
github-actions bot pushed a commit that referenced this pull request Jan 15, 2026
## [5.9.2](v5.9.1...v5.9.2) (2026-01-15)

### Bug Fixes

* **cluster:** Cluster reconnect sharded subscribers ([#2060](#2060)) ([def9804](def9804))
* preserve replica slots on MOVED in pipelines ([#2059](#2059)) ([a1c3e9d](a1c3e9d))

### Reverts

* Revert "fix: preserve replica slots on MOVED in pipelines (#2059)" (#2062) ([517b932](517b932)), closes [#2059](#2059) [#2062](#2062)
@github-actions
Copy link

🎉 This PR is included in version 5.9.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

PavelPashov added a commit to PavelPashov/ioredis that referenced this pull request Jan 15, 2026
* fix: trigger reconnect when sharded subscriber slots refresh fails

When all sharded subscriber connections fail and the subsequent slots cache
refresh returns ClusterAllFailedError, the cluster now properly enters
reconnecting state instead of becoming zombied. This occurs when the cluster
topology changes and all nodes are replaced with new IPs - the subscriber
connections fail, triggering a slots refresh via the "-node" event handler.
If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS
times out or closes), the cluster becomes stuck in "ready" state with no
working connections because normal pool connections use lazyConnect: true
and never emit "end" events to trigger the drain->close->reconnect cycle.
Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback
that detects ClusterAllFailedError and calls disconnect(true) to force
reconnection, preventing the zombie state.

* test: ensure reconnect after sharded subscriber failure
PavelPashov added a commit to PavelPashov/ioredis that referenced this pull request Jan 15, 2026
* fix: trigger reconnect when sharded subscriber slots refresh fails

When all sharded subscriber connections fail and the subsequent slots cache
refresh returns ClusterAllFailedError, the cluster now properly enters
reconnecting state instead of becoming zombied. This occurs when the cluster
topology changes and all nodes are replaced with new IPs - the subscriber
connections fail, triggering a slots refresh via the "-node" event handler.
If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS
times out or closes), the cluster becomes stuck in "ready" state with no
working connections because normal pool connections use lazyConnect: true
and never emit "end" events to trigger the drain->close->reconnect cycle.
Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback
that detects ClusterAllFailedError and calls disconnect(true) to force
reconnection, preventing the zombie state.

* test: ensure reconnect after sharded subscriber failure
PavelPashov added a commit to PavelPashov/ioredis that referenced this pull request Jan 15, 2026
* fix: trigger reconnect when sharded subscriber slots refresh fails

When all sharded subscriber connections fail and the subsequent slots cache
refresh returns ClusterAllFailedError, the cluster now properly enters
reconnecting state instead of becoming zombied. This occurs when the cluster
topology changes and all nodes are replaced with new IPs - the subscriber
connections fail, triggering a slots refresh via the "-node" event handler.
If this refresh fails (e.g., the duplicated connection for CLUSTER SLOTS
times out or closes), the cluster becomes stuck in "ready" state with no
working connections because normal pool connections use lazyConnect: true
and never emit "end" events to trigger the drain->close->reconnect cycle.
Now subscriber-triggered refreshSlotsCache() calls use a dedicated callback
that detects ClusterAllFailedError and calls disconnect(true) to force
reconnection, preventing the zombie state.

* test: ensure reconnect after sharded subscriber failure
This was referenced Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants