Cae 633 add functional tests notifications by kiryazovi-redis · Pull Request #3357 · redis/lettuce

kiryazovi-redis · 2025-07-11T15:14:57Z

Make sure that:

You have read the contribution guidelines.
You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
You submit test cases (unit or integration tests) that back your changes.

CAE-633: Add Functional Tests for Redis Enterprise Push Notifications

Overview

This PR adds comprehensive functional tests to validate Redis Enterprise push notifications during maintenance events, ensuring the Lettuce client correctly captures and processes RESP3 push messages for cluster state changes.

What's Added

Complete test suite for all Redis Enterprise push notification types:
- MOVING - Endpoint migration notifications
- MIGRATING - Data migration in progress
- MIGRATED - Data migration completed
- FAILING_OVER - Failover initiation
- FAILED_OVER - Failover completion
Dynamic cluster discovery with real-time adaptation to cluster state changes
Proper RESP3 push message capture using Lettuce's PushListener API
Robust test isolation with state recording and restoration between tests

Technical Implementation

Push Notification Capture:

Replaced flawed ping-based approach with proper PushListener API integration
Added ByteBuffer decoding to extract actual IP addresses from push messages
Successfully hooks into Lettuce's protocol layer to capture RESP3 push messages

Cluster State Management:

Implemented RedisEnterpriseConfigDiscovery with cache clearing for fresh master/slave lists
Added dynamic node selection for failover operations
Real-time cluster configuration discovery prevents stale state issues

Fault Injection:

Enhanced FaultInjectionClient with proper validation and retry logic
Added support for migration operations with appropriate timeouts (300s)
Fixed infinite retry loops in validation

Critical Bug Fixes Applied

Double-Prefixing Bug: Fixed malformed shard IDs (redis:redis:1 → redis:1) in state recording
Migration Timeouts: Increased timeout from 120s to 300s for data migration operations
Restoration Logic: Fixed cluster state restoration to only failover master shards (Redis Enterprise constraint)

Test Results

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0
Total time: 09:05 min

All notification types successfully validated with proper push message capture and cluster state management.

Files Changed

MaintenanceNotificationTest.java - Main test suite with 5 comprehensive tests
FaultInjectionClient.java - Enhanced fault injection with proper timeouts
RedisEnterpriseConfigDiscovery.java - Dynamic cluster discovery implementation
RedisEnterpriseConfig.java - Cluster configuration models
Additional utility classes for test infrastructure

Benefits

Production readiness: Validates real-world Redis Enterprise maintenance scenarios
Proactive monitoring: Ensures applications can react to cluster state changes
Regression prevention: Comprehensive test coverage prevents future push notification issues
Dynamic adaptation: Tests work with any Redis Enterprise cluster configuration

Testing

Tests validated against live Redis Enterprise clusters with real maintenance operations. The test suite adapts dynamically to different cluster configurations and provides comprehensive coverage of all push notification scenarios.

This PR establishes a solid foundation for Redis Enterprise push notification testing, ensuring the Lettuce client maintains robust support for maintenance event handling in production environments.

* DOC-4758 added landing page examples * DOC-4758 added Path page examples * DOC-4758 work around arrAppend issue * DOC-4758 applied formatting

.vscode/settings.json

tishun

Good, but I think we can refactor the code a bit to make it more readable and easier to maintain

src/test/java/io/lettuce/scenario/FaultInjectionClient.java

src/test/java/io/lettuce/scenario/RedisEnterpriseConfigDiscovery.java

- Fixed receiveMovingPushNotificationTest to properly capture MOVING messages - Switched from flawed ping-based approach to proper PushListener API - Added ByteBuffer decoding to extract actual IP addresses from push messages - Enhanced two-step rebind process (migrate + bind) working correctly - Successfully capturing RESP3 push messages: MOVING, MIGRATING, MIGRATED, etc. - Test now passes and captures real MOVING notifications during endpoint rebind The test now properly hooks into Lettuce's protocol layer to capture RESP3 push messages that come before command responses, solving the core issue where we were only seeing PONG responses.

…h notification tests - Fixed stale cluster configuration issue in RedisEnterpriseConfigDiscovery - Added proper cache clearing in parseShards() method to get fresh master/slave lists - Added getNodeWithMasterShards() method for proper node selection in failover tests - Fixed infinite retry loops in FaultInjectionClient validation - Updated MaintenanceNotificationTest to use dynamic node selection for both FAILING_OVER and FAILED_OVER tests - Added CONTEXT_MOVING_PUSH_NOTIFICATIONS.md to .gitignore - All push notification tests now working with real-time dynamic cluster discovery

- Fix double-prefixing bug in shard ID state recording * originalShardRoles.put(masterShard, 'master') instead of 'redis:' + masterShard * Prevents malformed redis:redis:1 IDs that broke state restoration - Increase migration timeout from 120s to 300s in FaultInjectionClient * Migration operations move actual data and need more time than failovers * Fixes receiveMigratedPushNotificationTest timeout failures - Fix cluster state restoration logic to only failover master shards * Redis Enterprise constraint: only master shards can be failed over * Prevents 'ERROR: No such master shard id' when trying to failover slaves * Only targets current masters that should become slaves - Complete test suite now passing: 5/5 tests successful * MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER notifications * Dynamic cluster discovery with proper state isolation * Real-time adaptation to cluster state changes between test runs Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

…numbers - Remove all legacy methods with hardcoded defaults from FaultInjectionClient: * triggerShardMigration(String, String) - 2-parameter version * triggerShardFailover(String, String) - 2-parameter version * triggerShardFailover(String, String, String) - 3-parameter version * triggerMovingNotification(String, String, String) - 3-parameter version * validateRladminCommandCompletion() - useless method - Replace magic numbers with meaningful constants across scenario folder: * Use Duration.ofMinutes() for long timeouts (300s->5min, 180s->3min, etc.) * Add descriptive constants for all timeout values * Improve code readability and maintainability - Update tests to use 4-parameter dynamic discovery methods - Fix Jackson import to use proper import instead of full package name - Update maintenance operation system to handle missing legacy methods All tests pass and compilation successful.

… tracking - Increase StepVerifier timeout for long-running operations (migrations/failovers) - Change from .verifyComplete() to .expectComplete().verify(LONG_OPERATION_TIMEOUT) - All tests now complete successfully without timing out - Remove .vscode/settings.json from git tracking

Copilot

Pull Request Overview

This PR adds comprehensive functional tests for Redis Enterprise push notifications during maintenance events, enabling the Lettuce client to properly capture and validate RESP3 push messages for cluster state changes. The implementation includes full test coverage for all notification types (MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER) with dynamic cluster discovery and proper state management.

Complete test suite validating all Redis Enterprise push notification types during maintenance operations
Dynamic cluster configuration discovery with real-time adaptation to cluster state changes
Enhanced fault injection client with proper timeout management and validation for Redis Enterprise operations

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

File	Description
MaintenanceNotificationTest.java	Main test suite with 5 comprehensive tests for all push notification types
RedisEnterpriseConfig.java	Dynamic cluster discovery and configuration management with validation
FaultInjectionClient.java	Enhanced fault injection with proper timeouts, validation, and Redis Enterprise operation support
FaultInjectionClientUnitTest.java	Unit tests for fault injection client validation and basic functionality

Comments suppressed due to low confidence (2)

Copilot · 2025-07-22T15:49:01Z

src/test/java/io/lettuce/scenario/RedisEnterpriseConfig.java

+     * Extract the numeric part of shard ID from full format "redis:X" -> "X"
+     */
+    private String extractNumericShardId(String fullShardId) {
+        if (fullShardId == null) {
+            return null;
+        }
+        if (fullShardId.contains(":")) {
+            return fullShardId.split(":")[1];
+        }
+        return fullShardId;
+    }
+
+    /**
+     * Extract the endpoint ID part from full format "endpoint:X:Y" -> "X:Y"
+     */
+    private String extractEndpointId(String fullEndpointId) {
+        if (fullEndpointId == null) {
+            return null;
+        }
+        if (fullEndpointId.startsWith("endpoint:")) {
+            return fullEndpointId.substring("endpoint:".length());
+        }
+        return fullEndpointId;


This method and extractNumericNodeId() have similar logic but handle null differently. Consider creating a generic extract method or ensuring consistent null handling patterns.

Suggested change

* Extract the numeric part of shard ID from full format "redis:X" -> "X"

*/

private String extractNumericShardId(String fullShardId) {

if (fullShardId == null) {

return null;

}

if (fullShardId.contains(":")) {

return fullShardId.split(":")[1];

}

return fullShardId;

}

/**

* Extract the endpoint ID part from full format "endpoint:X:Y" -> "X:Y"

*/

private String extractEndpointId(String fullEndpointId) {

if (fullEndpointId == null) {

return null;

}

if (fullEndpointId.startsWith("endpoint:")) {

return fullEndpointId.substring("endpoint:".length());

}

return fullEndpointId;

* Extract a specific part of a string based on a prefix and optional delimiter.

*/

private String extractPart(String input, String prefix, String delimiter) {

if (input == null) {

return null;

}

if (prefix != null && input.startsWith(prefix)) {

input = input.substring(prefix.length());

}

if (delimiter != null && input.contains(delimiter)) {

return input.split(delimiter)[1];

}

return input;

}

/**

* Extract the numeric part of shard ID from full format "redis:X" -> "X"

*/

private String extractNumericShardId(String fullShardId) {

return extractPart(fullShardId, "redis:", ":");

}

/**

* Extract the endpoint ID part from full format "endpoint:X:Y" -> "X:Y"

*/

private String extractEndpointId(String fullEndpointId) {

return extractPart(fullEndpointId, "endpoint:", null);

Copilot · 2025-07-22T15:49:02Z

src/test/java/io/lettuce/scenario/MaintenanceNotificationTest.java

+                List<String> currentShards = new ArrayList<>();
+
+                // Get current shards (already in "redis:X" format)
+                currentShards.addAll(currentConfig.getShardsForNode(nodeId));


Unnecessary ArrayList creation when currentShards is immediately populated with addAll(). Consider direct assignment: List currentShards = currentConfig.getShardsForNode(nodeId);

Suggested change

List<String> currentShards = new ArrayList<>();

// Get current shards (already in "redis:X" format)

currentShards.addAll(currentConfig.getShardsForNode(nodeId));

// Get current shards (already in "redis:X" format)

List<String> currentShards = currentConfig.getShardsForNode(nodeId);

Copilot · 2025-07-22T15:49:02Z

src/test/java/io/lettuce/scenario/MaintenanceNotificationTest.java

+                // Only failover shards that are currently masters but should be slaves
+                if ("slave".equals(originalRole) && currentConfig.getMasterShardIds().contains(shardId)) {
+                    // Should be slave but is currently master - failover this master
+                    mastersToFailover.add(shardId.replace("redis:", ""));


String replacement logic is duplicated from RedisEnterpriseConfig.extractNumericShardId(). Consider using the existing utility method or creating a shared utility.

Suggested change

mastersToFailover.add(shardId.replace("redis:", ""));

mastersToFailover.add(RedisEnterpriseConfig.extractNumericShardId(shardId));

Copilot · 2025-07-22T15:49:02Z

src/test/java/io/lettuce/scenario/FaultInjectionClient.java

            return Mono.just(false);
-        }).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval).timeout(timeout)
-                .doOnError(e -> log.error("Timeout waiting for action to complete", e)));
+        }).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval)).timeout(timeout)


The timeout is applied after repeatWhenEmpty, which means the timeout applies to the entire repeat sequence rather than individual checks. This could cause premature timeouts. Consider moving timeout inside the repeat or using takeUntil with a timer.

Suggested change

}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval)).timeout(timeout)

}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval))

.takeUntil(Mono.delay(timeout))

src/test/java/io/lettuce/scenario/FaultInjectionClient.java

…ions

kiryazovi-redis · 2025-07-28T08:18:49Z

will squash and merge, in order to continue with other testing, but all comments left will be taken care of no matter what.
The code is vibecoded, so no matter what happens we have to maintain its... less-maintanable parts.

…ample.java

* [Hitless Upgrades] React to maintenance events (#3345) * v0.1 * Simple reconnect now working * Bind address from message is now considered * Self-register the handler * Format code * Filter push messages in a more stable way * (very hacky) Relax comand expire timers globbaly * Configure if timeout relaxing should be applied * Proper way to close channel * Configure the timneout relaxing * Sequential handover implemented * Did not address formatting * Prolong the rebind windwow for relaxed tiemouts * PubSub no longer required; CommandExpiryWriter is now channel aware; Polishing * Use the new MOVING push message from the RE server * Unit test was not chaining delgates in the same way that the RedisClient/RedisClusterClient was * Fix REBIND message validation * Fixed the expiry mechanism * Polishing * Fix NPE. Seems like AttributeMap.attr is not accurate and actually return's null causing some unit test failures. * Add support for MIGRATING/MIGRATED message handling in command expiry This commit adds the ability to listen for MIGRATING and MIGRATED messages and trigger extended command expiry timeouts during Redis shard migration. Key changes: - Enhanced RebindAwareConnectionWatchdog to detect MIGRATING/MIGRATED messages - RebindAwareExpiryWriter to trigger timeout relaxation whenever MIGRATING message is received This feature allows commands to have relaxed timeouts during shard migration operations, preventing unnecessary timeouts when Redis is temporarily busy with migration tasks. * formating * Fix Disabling relaxTimeouts after upgrade can interfere with an ongoing one from re-bind * Additional fix for timeout relaxing disabled * Fix push message listener registered multiple times after rebind. * Fix: Report correct command timeout when relaxTimeout is configured * Disable relaxedTimeout after configured grace period - Introduce a delay before disabling relaxedTimeout - Grace period duration is provided via push notification * Code clean up - Remove reading from pub/sub chanel and relay only on push notifications * Add FAILING_OVER/FAILED_OVER * Polishing : Rename components to use the word 'maintenace' --------- Co-authored-by: Igor Malinovskiy <[email protected]> Co-authored-by: ggivo <[email protected]> # Conflicts: # src/main/java/io/lettuce/core/ClientOptions.java * [Hitless upgrades] Add unit tests for the newly introduced classes #3355 (#3356) * Unit tests for the maintanence aware classes * Did not format properly * Proper license * Cae 633 add functional tests notifications (#3357) - excluding JsonExample.java * Cae 1130 timeout tests (#3377) * initial WIP, with lots of debugging, and some non-working tests * debug * more attemtps at debugging * Refactor: Move cluster state management methods from MaintenanceNotificationTest to RedisEnterpriseConfig - Moved refreshClusterConfig, recordOriginalClusterState, and restoreOriginalClusterState methods - Updated call sites in MaintenanceNotificationTest to use RedisEnterpriseConfig methods - Added required imports and static variables to RedisEnterpriseConfig - Maintained existing functionality while improving code organization Improvements to RelaxedTimeoutConfigurationTest: - Simplified traffic generation logic by removing complex multi-phase testing - Streamlined BLPOP command execution with better timeout detection - Added relaxed timeout detection and recording during maintenance events - Improved logging and error handling for timeout analysis - Enhanced test assertions to focus on relaxed timeout detection rather than success counts - Added MOVING operation duration tracking for better test analysis * Improve test reliability and cleanup: Add @AfterEach cleanup, enhance endpoint tracking, and improve logging * add un-relaxed tests. will investigate further why they got broken at some point via diff * CAE-1130: Update timeout configuration test and watchdog implementation * Reset MaintenanceAwareConnectionWatchdog.java and log4j2-test.xml to upstream versions * Clean up debug info and outdated comments from timeout tests - Remove large debug block with reflection-based debugging in RelaxedTimeoutConfigurationTest - Simplify excessive debug logging and verbose markers (*** and ===) - Clean up maintenance notification test logging - Improve push notification monitor message formatting - Maintain all test functionality while improving code readability * Refactor: Move inline comments above code and fix string comparisons - Move all inline comments to be above the code they reference in: * RelaxedTimeoutConfigurationTest.java * RedisEnterpriseConfig.java * MaintenanceNotificationTest.java - Replace string != comparisons with .equals() for proper string comparison - Apply code formatting via Maven formatter This improves code readability and follows Java best practices for string comparison. * [Hitless Upgrades] Zero-server-side configuration with client-side opt-in (#3380) * Support for Client-side opt-in A client can tell the server if it wants to receive maintenance push notifications via the following command: CLIENT MAINT_NOTIFICATIONS <ON | OFF> [parameter value parameter value ...] * update maintenance events to latest format - MIGRATING <seq_number> <time> <shard_id-s>: A shard migration is going to start within <time> seconds. - MIGRATED <seq_number> <shard_id-s>: A shard migration ended. - FAILING_OVER <seq_number> <time> <shard_id-s>: A shard failover of a healthy shard started. - FAILED_OVER <seq_number> <shard_id-s>: A shard failover of a healthy shard ended. - MOVING <seq_number> <time> <endpoint>: A specific endpoint is going to move to another node within <time> seconds * clean up * Update FAILED_OVER & MIGRATED to include additional time field * update is private reserver check & add unit tests update is private reserver check * add unit tests for handshake with enabled maintenance events * add missing copyrights/docs * format * address review comments * Revert address after rebind operation expires * Update event's validation spec - MIGRATING <seq_number> <time> <shard_id-s>: A shard migration is going to start within <time> seconds. - MIGRATED <seq_number> <shard_id-s>: A shard migration ended. - FAILING_OVER <seq_number> <time> <shard_id-s>: A shard failover of a healthy shard started. - FAILED_OVER <seq_number> <shard_id-s>: A shard failover of a healthy shard ended. - MOVING <seq_number> <time> <endpoint>: A specific endpoint is going to move to another node within <time> seconds * rebase * format after rebase * Apply suggestions from code review Co-authored-by: Tihomir Krasimirov Mateev <[email protected]> * javadoc updated * Update src/main/java/io/lettuce/core/internal/NetUtils.java Co-authored-by: Tihomir Krasimirov Mateev <[email protected]> --------- Co-authored-by: Tihomir Krasimirov Mateev <[email protected]> * [hitless upgrade] Support for none moving-endpoint-type in maintenance event notifications (CAE-1285) (#3396) * support MOVING with none none indicates that the MOVING message doesn’t need to contain an endpoint. In such a case, the client is expected to schedule a graceful reconnect to its currently configured endpoint after half of the grace period that was communicated by the server is over. * formatting * Apply suggestions from code review Co-authored-by: Copilot <[email protected]> * Fix NPE * Add test to cover null rebindAddress null for rebind adress can be returned as part of MOVING notification if client is connected using 'moving-endpoint-type=none' * Add java docs to RebindAwareAddressSupplier --------- Co-authored-by: Copilot <[email protected]> * [Hitless Upgrades] Enable maintenance events support by default (CAE-1303) (#3415) * set default relaxed timeout to 10s * Enable maintenance events support by default * Enable maintenance events support by default * fix tests - ensure MaintenanceAwareExpiryWriter is registered for events when wathcdog is created - command timeout was not applied * fix tests - sporadic failure because of timeout of 50ms RedisHandshakeUnitTests.handshakeDelayedCredentialProvider:153 » ConditionTimeout - new command introduced during handshake, increase the timeout to 100ms * resolve errors after rebase on main - reset() - removed with issue#3328 - remove deprecated code from issue#907 (#3395) * Remove LettuceMaintenanceEventsDemo.java - no longer needed. --------- Co-authored-by: Igor Malinovskiy <[email protected]> Co-authored-by: ggivo <[email protected]> Co-authored-by: kiryazovi-redis <[email protected]> Co-authored-by: Copilot <[email protected]>

DOC-4758 async JSON doc examples (redis#3335)

371beb0

* DOC-4758 added landing page examples * DOC-4758 added Path page examples * DOC-4758 work around arrAppend issue * DOC-4758 applied formatting

kiryazovi-redis requested review from Copilot, ggivo and tishun July 11, 2025 15:14

kiryazovi-redis self-assigned this Jul 11, 2025

This comment was marked as outdated.

Sign in to view

tishun reviewed Jul 14, 2025

View reviewed changes

.vscode/settings.json Outdated Show resolved Hide resolved

tishun added the type: task A general task label Jul 14, 2025

tishun added this to the Async milestone Jul 14, 2025

tishun reviewed Jul 14, 2025

View reviewed changes

kiryazovi-redis added 5 commits July 22, 2025 14:53

WIP: staging changes before rebase to lettuce-observability

624eb1a

kiryazovi-redis force-pushed the CAE-633-add-functional-tests-notifications branch from 505ca71 to 4d87ba7 Compare July 22, 2025 11:53

kiryazovi-redis requested a review from Copilot July 22, 2025 11:54

This comment was marked as outdated.

Sign in to view

kiryazovi-redis requested a review from Copilot July 22, 2025 15:47

Copilot AI reviewed Jul 22, 2025

View reviewed changes

kiryazovi-redis requested a review from tishun July 22, 2025 15:51

Merge branch 'redis:main' into CAE-633-add-functional-tests-notificat…

8a07c8a

…ions

kiryazovi-redis merged commit 7f3c59b into redis:feature/maintenance-events Jul 28, 2025
2 checks passed

kiryazovi-redis added a commit that referenced this pull request Jul 28, 2025

Cae 633 add functional tests notifications (#3357) - excluding JsonEx…

a133028

…ample.java

ggivo pushed a commit that referenced this pull request Aug 29, 2025

Cae 633 add functional tests notifications (#3357) - excluding JsonEx…

9980399

…ample.java

	mastersToFailover.add(shardId.replace("redis:", ""));
	mastersToFailover.add(RedisEnterpriseConfig.extractNumericShardId(shardId));

	}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval)).timeout(timeout)
	}).repeatWhenEmpty(repeat -> repeat.delayElements(checkInterval))
	.takeUntil(Mono.delay(timeout))

Conversation

kiryazovi-redis commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CAE-633: Add Functional Tests for Redis Enterprise Push Notifications

Overview

What's Added

Technical Implementation

Critical Bug Fixes Applied

Test Results

Files Changed

Benefits

Testing

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

tishun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kiryazovi-redis commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kiryazovi-redis commented Jul 11, 2025 •

edited

Loading