Redis Enterprise Maintenance Events: Comprehensive Functional Testing by kiryazovi-redis · Pull Request #3461 · redis/lettuce

kiryazovi-redis · 2025-10-01T17:02:00Z

Summary

This PR adds comprehensive functional testing for Redis Enterprise maintenance event handling in Lettuce, including connection handoff, timeout relaxation, and notification protocol validation. The test suite validates proper client behavior during database migrations and failover scenarios.

Key Changes

New Test Infrastructure

MaintenanceNotificationConnectionTest
- Comprehensive connection lifecycle testing during maintenance events
- Tests for all 5 notification types: MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER
- Validates connection rebinding with different endpoint types (EXTERNAL_IP, EXTERNAL_FQDN, NONE)
- Connection handoff verification with dual-connection scenarios
- Traffic resumption testing with async GET/SET operations
- BLPOP timeout unblocking during MOVING events with connection closure
- EventBus monitoring for connection state transitions
ConnectionEventBusMonitoringUtil
- Utility for tracking connection lifecycle events (connected, disconnected, activated, deactivated)
- Supports connection handoff detection and verification
- Provides waiting mechanisms for connection state transitions

Enhanced Tests

RelaxedTimeoutConfigurationTest
- Comprehensive 4-phase MOVING timeout test: relaxed during MIGRATING → unrelaxed after MIGRATED → relaxed during MOVING → unrelaxed after MOVING completion
- Failover timeout test: relaxed during FAILING_OVER → unrelaxed after FAILED_OVER
- Optimized test execution time by 50%
- Connection handoff testing integrated
ConnectionTestUtil
- Added support for MaintenanceAwareExpiryWriter delegation
- Improved channel extraction for maintenance-aware connections
RedisEnterpriseConfig
- Removed hardcoded configurations
- Support for 6 nodes and multiple databases
- Dynamic cluster configuration refresh

Removed

MaintenanceNotificationTest - All functionality migrated to MaintenanceNotificationConnectionTest with improved test structure
FaultInjectionClientUnitTest - Consolidated into integration tests

Test Coverage

The test suite validates:

Connection Rebinding - Proper connection handoff to new endpoints during maintenance
Endpoint Type Support - Correct handling of EXTERNAL_IP, EXTERNAL_FQDN, and NONE endpoint configurations
Traffic Continuity - Traffic resumes correctly after MOVING events with async operations
Notification Protocol - Validation of push notification format for all 5 event types
Timeout Relaxation - Dynamic timeout adjustment during maintenance windows
Timeout De-relaxation - Proper timeout restoration after maintenance completion
Connection Lifecycle - EventBus monitoring of connection state transitions
BLPOP Handling - Blocking operations correctly unblocked during connection transitions

Performance Improvements

Reduced test execution time from 2 hrs to ~20-ish mins
- Removal of unnecessary cleanup operations
- Elimination of hardcoded sleeps in favor of functional await conditions
- Optimized test setup and teardown

Testing

All tests pass successfully with proper logging enabled for debugging purposes.

…rise maintenance events - Add ConnectionTesting class with 9 test scenarios for maintenance handoff behavior - Test old connection graceful shutdown during MOVING operations - Validate traffic resumption with autoconnect after handoff - Verify maintenance notifications only work with RESP3 protocol - Test new connection establishment during migration and bind phases - Add memory leak validation for multiple concurrent connections - Include TLS support testing for maintenance events - Replace .supportMaintenanceEvents(true) with MaintenanceEventsOptions.enabled() - Add comprehensive monitoring and validation of connection lifecycle Tests cover CAE-1130 requirements for Redis Enterprise maintenance event handling including connection draining, autoconnect behavior, and notification delivery.

…IONS - connectionHandshakeIncludesEnablingNotificationsTest: Verifies all 5 notification types (MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER) are received when maintenance events are enabled - disabledDontReceiveNotificationsTest: Verifies no notifications received when maintenance events are disabled - clientHandshakeWithEndpointTypeTest: Tests CLIENT MAINT_NOTIFICATIONS with 'none' endpoint type (nil IP scenario) - clientMaintenanceNotificationInfoTest: Verifies CLIENT MAINT_NOTIFICATIONS configuration with moving-endpoint-type Based on CLIENT MAINT_NOTIFICATIONS implementation from commit bd408cf

- Update push notification patterns to include sequence numbers (4-element format) - Fix MOVING notification parsing to handle new address format with sequence and time - Update MIGRATING, MIGRATED, FAILING_OVER, and FAILED_OVER patterns with sequence numbers - Improve FaultInjectionClient status handling: change from 'pending' to 'running' checks - Enhance JSON response parsing with better output field handling and debugging - Remove deprecated maintenance sequence functionality and associated unit test - Add test phase isolation to prevent cleanup notification interference - Extend monitoring timeout from 2 to 5 minutes for longer maintenance operations - Add @AfterEach cleanup to restore cluster state between tests - Remove hardcoded optimal node selection logic in RedisEnterpriseConfig This aligns with the updated Redis Enterprise maintenance events specification and improves test reliability by handling the new notification protocol format.

…ltiple dbs

…n handoff test

…raised during review

…up tests by 50%

…and-connection-testing-of-maint-events-2 Ci fix functional handoff and connection testing of maint events 2

…ne by Ivo

…ow, also implement more fixes and improvements

…fixes from review, started removin comments

…nctionally correct, enable logging for testing

jit-ci · 2025-10-01T17:02:05Z

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

Copilot

Pull Request Overview

This PR adds comprehensive functional testing for Redis Enterprise maintenance event handling in Lettuce, with a focus on connection lifecycle management, notification validation, and timeout relaxation. The test suite validates proper client behavior during database migrations and failover scenarios.

Key changes:

New comprehensive test class MaintenanceNotificationConnectionTest for testing all 5 maintenance notification types with connection management
Enhanced timeout configuration tests with 4-phase validation and optimized execution
Complete removal of legacy test classes and their consolidation into improved integration tests

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/test/resources/log4j2-test.xml	Enhanced logging configuration for maintenance-aware components
src/test/java/io/lettuce/test/ConnectionTestUtil.java	Added MaintenanceAwareExpiryWriter support and connection verification utilities
src/test/java/io/lettuce/scenario/RelaxedTimeoutConfigurationTest.java	Comprehensive 4-phase timeout testing with optimized execution and connection handoff
src/test/java/io/lettuce/scenario/RedisEnterpriseConfig.java	Dynamic cluster configuration with 6-node support and endpoint-aware operations
src/test/java/io/lettuce/scenario/MaintenancePushNotificationMonitor.java	Simplified monitoring without periodic pings and correct notification format parsing
src/test/java/io/lettuce/scenario/MaintenanceNotificationTest.java	Removed - functionality migrated to MaintenanceNotificationConnectionTest
src/test/java/io/lettuce/scenario/MaintenanceNotificationConnectionTest.java	New comprehensive test suite for all maintenance scenarios with connection lifecycle validation
src/test/java/io/lettuce/scenario/FaultInjectionClientUnitTest.java	Removed - consolidated into integration tests
src/test/java/io/lettuce/scenario/FaultInjectionClient.java	Enhanced with endpoint-aware operations and improved status checking
src/test/java/io/lettuce/scenario/ConnectionEventBusMonitoringUtil.java	New utility for monitoring connection state transitions via EventBus

Comments suppressed due to low confidence (1)

src/test/java/io/lettuce/scenario/MaintenanceNotificationConnectionTest.java:1

Using fixed Thread.sleep() delays in tests can lead to flaky behavior and unnecessarily slow test execution. Consider using condition-based waiting or shorter, more appropriate delays for these specific scenarios.

package io.lettuce.scenario;

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/test/java/io/lettuce/scenario/RelaxedTimeoutConfigurationTest.java

Copilot · 2025-10-01T17:05:56Z

src/test/java/io/lettuce/scenario/RelaxedTimeoutConfigurationTest.java

+            log.info("Testing unrelaxed timeouts after MIGRATED notification...");
+            log.info("Waiting for grace period ({}ms) to allow timeout un-relaxation to take effect...",
+                    2 * RELAXED_TIMEOUT_ADDITION.toMillis());
+            Thread.sleep(2 * RELAXED_TIMEOUT_ADDITION.toMillis()); // Wait for grace period + 100ms buffer


Using Thread.sleep() for timing in tests can lead to flaky behavior. Consider using Awaitility or similar polling mechanisms to wait for the actual condition (timeout un-relaxation) rather than relying on fixed sleep durations.

Copilot · 2025-10-01T17:05:56Z

src/test/java/io/lettuce/scenario/RelaxedTimeoutConfigurationTest.java

+            log.info("Waiting for grace period ({}ms) to allow timeout un-relaxation to take effect...",
+                    2 * RELAXED_TIMEOUT_ADDITION.toMillis());
+            Thread.sleep(2 * RELAXED_TIMEOUT_ADDITION.toMillis()); // Wait for grace period + 100ms buffer
+            log.info("Grace period expired, starting traffic to verify unrelaxed timeouts...");


Another instance of Thread.sleep() that could make tests flaky. This should be replaced with a condition-based wait mechanism to ensure the timeout un-relaxation has actually taken effect before proceeding with verification.

Suggested change

log.info("Waiting for grace period ({}ms) to allow timeout un-relaxation to take effect...",

2 * RELAXED_TIMEOUT_ADDITION.toMillis());

Thread.sleep(2 * RELAXED_TIMEOUT_ADDITION.toMillis()); // Wait for grace period + 100ms buffer

log.info("Grace period expired, starting traffic to verify unrelaxed timeouts...");

log.info("Waiting for timeout un-relaxation to take effect (up to {}ms)...",

2 * RELAXED_TIMEOUT_ADDITION.toMillis());

waitForTimeoutUnrelaxed(context, 2 * RELAXED_TIMEOUT_ADDITION.toMillis());

log.info("Timeout un-relaxation detected, starting traffic to verify unrelaxed timeouts...");

Copilot · 2025-10-01T17:05:57Z

src/test/java/io/lettuce/scenario/MaintenanceNotificationConnectionTest.java

+                        }
+
+                        // Throttle traffic to ~1000 ops/sec to avoid memory pressure
+                        Thread.sleep(1);


Using Thread.sleep(1) in a tight loop for traffic throttling is inefficient and imprecise. Consider using a proper rate limiter or a more accurate timing mechanism for controlling traffic generation at ~1000 ops/sec.

tishun

LGTM, small notes only

tishun · 2025-10-02T10:16:38Z

src/test/java/io/lettuce/scenario/ConnectionEventBusMonitoringUtil.java

+import io.netty.util.internal.logging.InternalLogger;
+import io.netty.util.internal.logging.InternalLoggerFactory;
+
+public class ConnectionEventBusMonitoringUtil {


Let's avoid the "Util" pattern. Utility classes quickly become swiss-army-knives and violate the single responsibility principle.

Could we just name it ConnectionEventBusMonitor or something similar?

tishun · 2025-10-02T10:17:29Z

src/test/java/io/lettuce/scenario/ConnectionEventBusMonitoringUtil.java

+        } catch (Exception e) {
+            if (event instanceof ConnectedEvent) {
+                return "connected-" + ((ConnectedEvent) event).remoteAddress().toString();
+            } else if (event instanceof DisconnectedEvent) {
+                return "disconnected-" + ((DisconnectedEvent) event).remoteAddress().toString();
+            } else {
+                return event.getClass().getSimpleName() + "-" + System.currentTimeMillis();
+            }
+        }


Instead of if-else and instanceof can we use multiple catch statements?

tishun · 2025-10-02T10:33:48Z

src/test/java/io/lettuce/scenario/MaintenanceNotificationConnectionTest.java

+                            expectedEndpoint, cleanCurrentEndpoint).isTrue();
+                }
+            } else {
+                log.warn("⚠ Could not verify endpoint - currentRemoteAddress: {}, expectedEndpoint: {}", currentRemoteAddress,


Should this fail the test?

tishun · 2025-10-02T10:38:23Z

src/test/java/io/lettuce/scenario/MaintenanceNotificationConnectionTest.java

+                log.info("Second connection created and monitoring setup completed");
+
+            } catch (Exception e) {
+                log.error("Failed to create second connection: {}", e.getMessage(), e);


Same question here - should the test fail?

tishun · 2025-10-02T10:52:10Z

src/test/java/io/lettuce/scenario/RedisEnterpriseConfig.java

 /**
 * Configuration holder for dynamically discovered Redis Enterprise cluster information.
 */
 public class RedisEnterpriseConfig {


RedisEnterpriseConfig - a "Config" object is typically a POJO that has some configuration details in it. In contrast this class seems to be configuring the RS
So I'd put some other end to the name like "Configurer" or idk

kiryazovi-redis · 2025-10-02T14:07:03Z

I will take care of Tisho's comments on the topic in next PR.
LIkely will take more comments from Ivo also.

…redis#3461) * feat(CAE-1130): Add comprehensive connection testing for Redis Enterprise maintenance events - Add ConnectionTesting class with 9 test scenarios for maintenance handoff behavior - Test old connection graceful shutdown during MOVING operations - Validate traffic resumption with autoconnect after handoff - Verify maintenance notifications only work with RESP3 protocol - Test new connection establishment during migration and bind phases - Add memory leak validation for multiple concurrent connections - Include TLS support testing for maintenance events - Replace .supportMaintenanceEvents(true) with MaintenanceEventsOptions.enabled() - Add comprehensive monitoring and validation of connection lifecycle Tests cover CAE-1130 requirements for Redis Enterprise maintenance event handling including connection draining, autoconnect behavior, and notification delivery. * Add comprehensive maintenance events tests for CLIENT MAINT_NOTIFICATIONS - connectionHandshakeIncludesEnablingNotificationsTest: Verifies all 5 notification types (MOVING, MIGRATING, MIGRATED, FAILING_OVER, FAILED_OVER) are received when maintenance events are enabled - disabledDontReceiveNotificationsTest: Verifies no notifications received when maintenance events are disabled - clientHandshakeWithEndpointTypeTest: Tests CLIENT MAINT_NOTIFICATIONS with 'none' endpoint type (nil IP scenario) - clientMaintenanceNotificationInfoTest: Verifies CLIENT MAINT_NOTIFICATIONS configuration with moving-endpoint-type Based on CLIENT MAINT_NOTIFICATIONS implementation from commit bd408cf * Update Redis Enterprise maintenance event notification protocol - Update push notification patterns to include sequence numbers (4-element format) - Fix MOVING notification parsing to handle new address format with sequence and time - Update MIGRATING, MIGRATED, FAILING_OVER, and FAILED_OVER patterns with sequence numbers - Improve FaultInjectionClient status handling: change from 'pending' to 'running' checks - Enhance JSON response parsing with better output field handling and debugging - Remove deprecated maintenance sequence functionality and associated unit test - Add test phase isolation to prevent cleanup notification interference - Extend monitoring timeout from 2 to 5 minutes for longer maintenance operations - Add @AfterEach cleanup to restore cluster state between tests - Remove hardcoded optimal node selection logic in RedisEnterpriseConfig This aligns with the updated Redis Enterprise maintenance events specification and improves test reliability by handling the new notification protocol format. * Fix moving tests for timeout de-relaxation after moving * fix notification capture logic and several tests. * fix up resp2 test, and add proper test for None, will rebase to master * Fix None test * Fix several tests related to handling. 5 tests left to fix up. * fix up new connection test and connection leak tests * fix up traffic test and remove un-needed code. * fix more tests, remove more un-needed code * revert log changes * revert the re-throw change, to be discussed * remove resp3 test after offline discussion * change endpoint name * temporarely reduce number of tests * add more tests * reduce test execution time by 50% * remove hardcoded target config and enable working with 6 nodes and multiple dbs * fix up relaxedtimeoutconfig to use newest functions and add connection handoff test * add 1 more handoff test, add more logging, fix some issues that were raised during review * fix some bugs and remove the un-needed clean-up of testing, to speed up tests by 50% * Merge pull request redis#1 from kiryazovi-redis/CI-fix-functional-handoff-and-connection-testing-of-maint-events-2 Ci fix functional handoff and connection testing of maint events 2 * optimise relaxed timeoutest, fix compilation issues after renaming done by Ivo * remove MaintenanceNotificationTest, as all functionality is covered now, also implement more fixes and improvements * renamed memoryleak infra, refactored several tests, implemented more fixes from review, started removin comments * remove any useless sleeps, refactor several tests again to be more functionally correct, enable logging for testing

kiryazovi-redis added 27 commits September 29, 2025 21:15

Fix moving tests for timeout de-relaxation after moving

8c1d42c

fix notification capture logic and several tests.

8407a37

fix up resp2 test, and add proper test for None, will rebase to master

c83b3ee

Fix None test

33da573

Fix several tests related to handling. 5 tests left to fix up.

10427a6

fix up new connection test and connection leak tests

9efaef8

fix up traffic test and remove un-needed code.

b89e024

fix more tests, remove more un-needed code

d0e80cd

revert log changes

2dceb16

revert the re-throw change, to be discussed

cff4604

remove resp3 test after offline discussion

0847f8b

change endpoint name

15957c4

temporarely reduce number of tests

5aea378

add more tests

6e6972b

reduce test execution time by 50%

fb06019

remove hardcoded target config and enable working with 6 nodes and mu…

a5dd659

…ltiple dbs

fix up relaxedtimeoutconfig to use newest functions and add connectio…

94ecd1c

…n handoff test

add 1 more handoff test, add more logging, fix some issues that were …

9686f24

…raised during review

fix some bugs and remove the un-needed clean-up of testing, to speed …

5f52dff

…up tests by 50%

Merge pull request #1 from kiryazovi-redis/CI-fix-functional-handoff-…

e960dc1

…and-connection-testing-of-maint-events-2 Ci fix functional handoff and connection testing of maint events 2

optimise relaxed timeoutest, fix compilation issues after renaming do…

f6c227f

…ne by Ivo

remove MaintenanceNotificationTest, as all functionality is covered n…

7ed1b6b

…ow, also implement more fixes and improvements

renamed memoryleak infra, refactored several tests, implemented more …

57a35ab

…fixes from review, started removin comments

remove any useless sleeps, refactor several tests again to be more fu…

3fe9c3f

…nctionally correct, enable logging for testing

kiryazovi-redis requested review from Copilot and removed request for Copilot October 1, 2025 17:04

kiryazovi-redis requested review from Copilot, ggivo and tishun October 1, 2025 17:04

Copilot AI reviewed Oct 1, 2025

View reviewed changes

tishun approved these changes Oct 2, 2025

View reviewed changes

kiryazovi-redis merged commit e29bff2 into redis:main Oct 2, 2025
11 checks passed

ggivo added the type: task A general task label Oct 23, 2025

Conversation

kiryazovi-redis commented Oct 1, 2025

Summary

Key Changes

New Test Infrastructure

Enhanced Tests

Removed

Test Coverage

Performance Improvements

Testing

Uh oh!

jit-ci bot commented Oct 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

tishun left a comment

Choose a reason for hiding this comment

Uh oh!

tishun Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

tishun Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

tishun Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

tishun Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

tishun Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kiryazovi-redis commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants