Skip to content

CuratorZookeeperClient were reset unexpectedly #1248

@xingsuo-zbz

Description

@xingsuo-zbz

I set the sessionTimeoutMs to 1d, but the actual effective value is 500654ms.

Testing Details

server conf:

tickTime=2000
initLimit=10
syncLimit=5
minSessionTimeout=7200000
maxSessionTimeout=86400000

curator client conf:

CuratorFrameworkFactory.builder()
    .connectString(zkQuorum)
    .sessionTimeoutMs(86400000)
    .connectionTimeoutMs(15000)
    .simulatedSessionExpirationPercent(100)
    .retryPolicy(new ExponentialBackoffRetry(5000, 24))
    .namespace("xxx")
    .aclProvider(aclProvider);

There are 3 zookeeper servers, kill 2 of them, simulate a long-term unavailability failure of zookeeper.

The curator client enters SUSPEND state after the leader is unavailable, and is expected to enter LOST state after 1 day, but in reality it will enter LOST state after about 8 minutes.

Related logs:
2025-02-21 18:55:12,181 [main-EventThread] DEBUG org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState - Negotiated session timeout: 86400000
2025-02-21 19:03:33,443 [Curator-ConnectionStateManager-0] WARN org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 500654. Adjusted session timeout ms: 500654

Root cause

(useSessionTimeoutMs * sessionExpirationPercent) resulted in integer overflow, CuratorZookeeperClient were reset unexpectedly:

private int getUseSessionTimeoutMs() {
int lastNegotiatedSessionTimeoutMs = client.getZookeeperClient().getLastNegotiatedSessionTimeoutMs();
int useSessionTimeoutMs =
(lastNegotiatedSessionTimeoutMs > 0) ? lastNegotiatedSessionTimeoutMs : sessionTimeoutMs;
useSessionTimeoutMs = sessionExpirationPercent > 0 && startOfSuspendedEpoch != 0
? (useSessionTimeoutMs * sessionExpirationPercent) / 100
: useSessionTimeoutMs;
return useSessionTimeoutMs;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions