Skip to content

Handle -READONLY as a redirection signal for Redis Cluster (AWS ElastiCache support)#1656

Merged
vladvildanov merged 3 commits into
predis:mainfrom
NivekNK:feature/readonly-handler
Mar 23, 2026
Merged

Handle -READONLY as a redirection signal for Redis Cluster (AWS ElastiCache support)#1656
vladvildanov merged 3 commits into
predis:mainfrom
NivekNK:feature/readonly-handler

Conversation

@NivekNK
Copy link
Copy Markdown
Contributor

@NivekNK NivekNK commented Mar 18, 2026

This PR introduces explicit handling for the -READONLY error response within the RedisCluster connection class.

The Problem:
In AWS ElastiCache (Redis OSS mode), when using the Configuration Endpoint (DNS round-robin), a client might occasionally connect to a replica. If a Lua script (with KEYS[]) is executed against this replica, the server returns a -READONLY You can't write against a read only replica error instead of a -MOVED redirection.

Currently, Predis treats this as a generic ServerException. This can lead to intermittent failures because the internal slot map isn't explicitly marked as stale or updated upon receiving this specific protocol error, causing the retry to potentially hit the same node.

The Solution:

  1. Modified onErrorResponse to intercept the READONLY prefix.
  2. Implemented onReadOnlyResponse which:
    • Disconnects the current faulty connection.
    • Triggers askSlotMap() to refresh the cluster topology.
    • Re-executes the command, allowing the distributor to pick the correct Master node based on the updated map.

This approach follows the existing reactive discovery pattern in Predis and ensures high availability in AWS ElastiCache environments without requiring a new configuration option.

@vladvildanov
Copy link
Copy Markdown
Contributor

@NivekNK I still don't understand why do we need this specific post-retry exception handling. Since, READONLY error is wrapped into ServerException and handling of this exception enables slot map update.

Please provide a unit test case that shows the behaviour if READONLY error wrapped into ServerException is thrown. I want to understand if internally we do retry and update topology and if we do, why do we need another topology update after all retries happened

@NivekNK
Copy link
Copy Markdown
Contributor Author

NivekNK commented Mar 19, 2026

@vladvildanov Hi! I've added the unit test you requested to demonstrate the exact behavior when a READONLY error is wrapped inside a ServerException.

As the new test shows, when ServerException is thrown, Predis does catch it and triggers the retry mechanism. However, it does not trigger a topology update (askSlotMap() and disconnect() are never called) because onFailCallback() only evaluates and handles connection-level errors (ConnectionException). As a result, the automated retry loops infinitely against the very same broken node until it hits the retry limit, ultimately bubbling up the exception to the end user.

This is exactly why my explicit handling in onErrorResponse and onReadOnlyResponse is strictly necessary to cleanly survive AWS ElastiCache OSS failover events. By manually catching the -READONLY keyword inside ErrorResponse and explicitly triggering askSlotMap() + disconnect(), we actively force the topology to refresh and successfully re-route the failed command to the newly promoted primary node.

Let me know what you think and if you need any further adjustments.

I see the CI failed on testClusterExecutePipeline throwing a MOVED ServerException. Since my code exclusively touches the READONLY case in onErrorResponse(), it is unrelated to this flaky pipeline test failure. Could you please re-run the CI jobs?

@vladvildanov
Copy link
Copy Markdown
Contributor

@NivekNK Thanks for the detailed explanation, no more objections from my side!

@vladvildanov vladvildanov merged commit bb37322 into predis:main Mar 23, 2026
65 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants