I set the sessionTimeoutMs to 1d, but the actual effective value is 500654ms.
Testing Details
server conf:
tickTime=2000
initLimit=10
syncLimit=5
minSessionTimeout=7200000
maxSessionTimeout=86400000
curator client conf:
CuratorFrameworkFactory.builder()
.connectString(zkQuorum)
.sessionTimeoutMs(86400000)
.connectionTimeoutMs(15000)
.simulatedSessionExpirationPercent(100)
.retryPolicy(new ExponentialBackoffRetry(5000, 24))
.namespace("xxx")
.aclProvider(aclProvider);
There are 3 zookeeper servers, kill 2 of them, simulate a long-term unavailability failure of zookeeper.
The curator client enters SUSPEND state after the leader is unavailable, and is expected to enter LOST state after 1 day, but in reality it will enter LOST state after about 8 minutes.
Related logs:
2025-02-21 18:55:12,181 [main-EventThread] DEBUG org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState - Negotiated session timeout: 86400000
2025-02-21 19:03:33,443 [Curator-ConnectionStateManager-0] WARN org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 500654. Adjusted session timeout ms: 500654
Root cause
(useSessionTimeoutMs * sessionExpirationPercent) resulted in integer overflow, CuratorZookeeperClient were reset unexpectedly:
|
int useSessionTimeoutMs = getUseSessionTimeoutMs(); |
|
private int getUseSessionTimeoutMs() { |
|
int lastNegotiatedSessionTimeoutMs = client.getZookeeperClient().getLastNegotiatedSessionTimeoutMs(); |
|
int useSessionTimeoutMs = |
|
(lastNegotiatedSessionTimeoutMs > 0) ? lastNegotiatedSessionTimeoutMs : sessionTimeoutMs; |
|
useSessionTimeoutMs = sessionExpirationPercent > 0 && startOfSuspendedEpoch != 0 |
|
? (useSessionTimeoutMs * sessionExpirationPercent) / 100 |
|
: useSessionTimeoutMs; |
|
return useSessionTimeoutMs; |
|
} |
I set the sessionTimeoutMs to 1d, but the actual effective value is 500654ms.
Testing Details
server conf:
curator client conf:
There are 3 zookeeper servers, kill 2 of them, simulate a long-term unavailability failure of zookeeper.
The curator client enters SUSPEND state after the leader is unavailable, and is expected to enter LOST state after 1 day, but in reality it will enter LOST state after about 8 minutes.
Related logs:
2025-02-21 18:55:12,181 [main-EventThread] DEBUG org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState - Negotiated session timeout: 864000002025-02-21 19:03:33,443 [Curator-ConnectionStateManager-0] WARN org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 500654. Adjusted session timeout ms: 500654Root cause
(useSessionTimeoutMs * sessionExpirationPercent)resulted in integer overflow,CuratorZookeeperClientwere reset unexpectedly:curator/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java
Line 284 in f0646f9
curator/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java
Lines 320 to 328 in f0646f9