Reproduce Procedure
- prepare etcd cluster with 3 nodes and respective etcdctl
/go/src/github.com/coreos/etcd # etcd --version
etcd Version: 3.4.9
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.14.2
Go OS/Arch: linux/amd64
though I don't think configure matters under this issue, let me present one of them
name: etcd2
data-dir: /data/zz_6129484611666145821
listen-peer-urls: http://0.0.0.0:23800
listen-client-urls: http://0.0.0.0:23790
initial-advertise-peer-urls: http://10.213.20.39:23800
advertise-client-urls: http://10.213.20.39:23790
initial-cluster: etcd0=http://10.213.20.40:23800,etcd1=http://10.213.20.38:23800,etcd2=http://10.213.20.39:23800
initial-cluster-token: zz
initial-cluster-state: new
auto-compaction-retention: "1"
quota-backend-bytes: -1
and the etcd processes are running simply by
etcd --config-file /etc/etcd.conf
no envirnments, no command line arguments
- enable auth
etcdctl user add root
# then type root password, mine is 'root'
etcdctl --user root:root auth enable
- watch
etcdctl --user root:root watch / --prefix
we can put a key to ensure the watch is working right now
- wait for token deleting
wait for 5 min until the token is deleted by simpleTokenKeeper, then kill the etcd processes one by one and restart them immediately after kill
be noted do NOT kill the process until the cluster recovers healthy
- watch fails
then you'll realize the watch is down with the output permission deny
Analysis
the issue is cause by simpleTokenKeeper, here is the timeline
- etcdctl dials grpc and fetches an auth token, let's say
TOKEN-A
- etcdctl dials grpc with
TOKEN-A, and watch / --prefix as expected
- after 5 min,
simpleTokenKeeper delete TOKEN-A
- watch continues working even if
TOKEN-A has been deleted, because token is only checked upon grpc invocation
- killing etcd process terminates connection, and etcdv3 client will re-invoke grpc Watch with the same token
TOKEN-A
authStore.AuthInfoFromCtx will return ErrInvalidAuthToken due to TOKEN-A no longer exists
Impact
the experiment is conducted using v3.4.9, the good part in this version is client will raise error permission deny and terminate watching;
however in our live cluster, the etcd server is v3.4.3, etcdv3 client is v3.3.8, and there will be no error, no log, no output, no termination, everything looks good but the watch has failed in silence, this is bad.
sometimes we can barely control the client version, such as calico-felix v3.4 binds with clientv3 v3.3.8, and upgrade calico version is subtle in live.
Improvement
In my opinion there are 2 ways to improve:
- we improve the watch keepalive mechanism. Correct me if wrong, we already have watch control message from server to client which are sent periodically; if we can have watch client respond with a
keepalive response or something like this, we can invoke simpleTokenKeeper.resetSimpleToken to renew TTL
- we improve client side. The etcdv3 Watcher is merely an interface that returns
<-chan WatchResponse, if we could encapsulate the re-fetching token and re-dialing grpc connection after receiving transport is closing WatchResponse, the Watcher client, outside the interface, would not be influenced.
Related issues
I presume the following issues are talking the exact same thing as I talk
#11121
#11381
looking forward to your kind feedback
Reproduce Procedure
though I don't think configure matters under this issue, let me present one of them
and the etcd processes are running simply by
no envirnments, no command line arguments
we can put a key to ensure the watch is working right now
wait for 5 min until the token is deleted by
simpleTokenKeeper, then kill the etcd processes one by one and restart them immediately after killbe noted do NOT kill the process until the cluster recovers healthy
then you'll realize the watch is down with the output
permission denyAnalysis
the issue is cause by
simpleTokenKeeper, here is the timelineTOKEN-ATOKEN-A, and watch / --prefix as expectedsimpleTokenKeeperdeleteTOKEN-ATOKEN-Ahas been deleted, because token is only checked upon grpc invocationTOKEN-AauthStore.AuthInfoFromCtxwill returnErrInvalidAuthTokendue toTOKEN-Ano longer existsImpact
the experiment is conducted using v3.4.9, the good part in this version is client will raise error
permission denyand terminate watching;however in our live cluster, the etcd server is v3.4.3, etcdv3 client is v3.3.8, and there will be no error, no log, no output, no termination, everything looks good but the watch has failed in silence, this is bad.
sometimes we can barely control the client version, such as calico-felix v3.4 binds with clientv3 v3.3.8, and upgrade calico version is subtle in live.
Improvement
In my opinion there are 2 ways to improve:
keepalive responseor something like this, we can invokesimpleTokenKeeper.resetSimpleTokento renew TTL<-chan WatchResponse, if we could encapsulate the re-fetching token and re-dialing grpc connection after receivingtransport is closingWatchResponse, the Watcher client, outside the interface, would not be influenced.Related issues
I presume the following issues are talking the exact same thing as I talk
#11121
#11381
looking forward to your kind feedback