This is a spin-off of #35290 to address the error Unable to complete atomic operation, key modified that causes the Tasks to be scheduled on the node.
Oct 26 09:43:13 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:13.534861035-07:00" level=error msg="fatal task error" error="Unable to complete atomic operation, key modified" module=node/agent/taskmanager node.id=mv4e72vawng0s82vvnb0tatc7 service.id=m9484e8pmri988v4n3bossl0x task.id=c1qap6z4nr50k9cacus63d8is
Thanks to @abhi , we reproduced the issue with debug enabled. The root-cause of this seems to be
introduced via #34674
Looking at the attached logs,
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806458067-07:00" level=debug msg="releasing IPv4 pools from network orange (pz8tw8vw40ogxi01c3pcthldi)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806483219-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.1)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806504836-07:00" level=debug msg="ReleasePool(LocalDefault/10.0.4.0/24)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806643570-07:00" level=debug msg="moby-demo-ubuntu-1(48d7c3d0f4a7): leaving network pz8tw8vw40ogxi01c3pcthldi"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806657694-07:00" level=debug msg="cleanupServiceDiscovery for network:pz8tw8vw40ogxi01c3pcthldi"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806665552-07:00" level=debug msg="cleanupServiceBindings for pz8tw8vw40ogxi01c3pcthldi"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806890107-07:00" level=warning msg="driver error deleting endpoint orange-endpoint : network id \"pz8tw8vw40ogxi01c3pcthldi\" not found"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806904850-07:00" level=debug msg="Releasing addresses for endpoint orange-endpoint's interface on network orange"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806915275-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.16)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806935796-07:00" level=warning msg="Failed to release ip address 10.0.4.16 on delete of endpoint orange-endpoint (d68ba4cb8deea9ed20c05c0abbd19c4db3250824dbc9133afc5bf2977f8e4765): cannot find address pool for poolID:LocalDefault/10.0.4.0/24"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806986817-07:00" level=error msg="network orange remove failed: unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807004206-07:00" level=error msg="remove task failed" error="unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7 task.id=d675fekf2hc5vxx188d5c0r5e
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807146117-07:00" level=warning msg="driver error deleting endpoint orange-endpoint : network id \"pz8tw8vw40ogxi01c3pcthldi\" not found"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807160203-07:00" level=debug msg="Releasing addresses for endpoint orange-endpoint's interface on network orange"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807169786-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.16)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807182810-07:00" level=warning msg="Failed to release ip address 10.0.4.16 on delete of endpoint orange-endpoint (d68ba4cb8deea9ed20c05c0abbd19c4db3250824dbc9133afc5bf2977f8e4765): cannot find address pool for poolID:LocalDefault/10.0.4.0/24"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809217709-07:00" level=error msg="network orange remove failed: unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809233253-07:00" level=error msg="remove task failed" error="unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7 task.id=qofrrwn7nf917tq9s93es6s8h
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809383599-07:00" level=warning msg="driver error deleting endpoint orange-endpoint : network id \"pz8tw8vw40ogxi01c3pcthldi\" not found"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809398355-07:00" level=debug msg="Releasing addresses for endpoint orange-endpoint's interface on network orange"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809408880-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.16)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809421881-07:00" level=warning msg="Failed to release ip address 10.0.4.16 on delete of endpoint orange-endpoint (d68ba4cb8deea9ed20c05c0abbd19c4db3250824dbc9133afc5bf2977f8e4765): cannot find address pool for poolID:LocalDefault/10.0.4.0/24"
It seems that even after the network is cleaned up on a worker node, the Load-balancer endpoint and sandbox for that network is continuously tried to be cleaned up and keeps failing.
Because of this, the endpoint_cnt key is created on the store while handling the endpoint cleanup phase. When the network is created again, it fails because of existing key.
We could hack endpoint_cnt to prevent this failure. But I think there is a fundamental issue with the way the per-network-per-node LB endpoint and sandbox are managed. We must resolve this asap.
Should we back-off #34674 till the issue is properly addressed ?
cc @pradipd @vieux @fcrisciani @abhi
daemon.log
This is a spin-off of #35290 to address the error
Unable to complete atomic operation, key modifiedthat causes the Tasks to be scheduled on the node.Thanks to @abhi , we reproduced the issue with debug enabled. The root-cause of this seems to be
introduced via #34674
Looking at the attached logs,
It seems that even after the network is cleaned up on a worker node, the Load-balancer endpoint and sandbox for that network is continuously tried to be cleaned up and keeps failing.
Because of this, the
endpoint_cntkey is created on the store while handling the endpoint cleanup phase. When the network is created again, it fails because of existing key.We could hack endpoint_cnt to prevent this failure. But I think there is a fundamental issue with the way the per-network-per-node LB endpoint and sandbox are managed. We must resolve this asap.
Should we back-off #34674 till the issue is properly addressed ?
cc @pradipd @vieux @fcrisciani @abhi
daemon.log