Description
Overlay network randomly has specific machines unreachable. Containers on these nodes are unable to reach containers on other nodes, and vice-versa. This happens much more frequently on larger clusters, but we've seen it on smaller ones as well.
We are using consul for our kv store.
Steps to reproduce the issue:
- Create a large overlay network, spanning many nodes. I can trigger this ~80% of the time with 25 nodes.
- Launch containers on multiple nodes, which need to communicate. In our case, mongodb, kafka, elasticsearch, and others.
- Watch them fail.
Describe the results you received:
Containers on some nodes are unable to connect to containers on other nodes. They can connect to containers on the same node.
[exabeam@ip-10-10-2-39 ~]$ sudo docker exec -it mongodb-configsvr-host1 bash
(docker) [root@mongodb-configsvr-host1 /]# ping mongodb-configsvr-host22
PING mongodb-configsvr-host22 (10.50.48.23) 56(84) bytes of data.
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=1 Destination Host Unreachable
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=2 Destination Host Unreachable
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=3 Destination Host Unreachable
From mongodb-configsvr-host1 (10.50.48.32) icmp_seq=4 Destination Host Unreachable
^C
--- mongodb-configsvr-host22 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4006ms
Journal logs:
-- Logs begin at Fri 2017-12-15 00:32:45 UTC, end at Fri 2017-12-15 01:00:37 UTC. --
Dec 15 00:40:12 ip-10-10-2-39 systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has begun starting up.
Dec 15 00:40:12 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:12.125750601Z" level=info msg="libcontainerd: new containerd process, pid: 15804"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.168203097Z" level=warning msg="failed to rename /var/lib/docker/tmp for background deletion: rename /var/lib/docker/tmp /var/lib/docker/tmp-old: no such file or directory. Deleting synchronously"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.356143183Z" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man dockerd` to refer to dm.thinpooldev section."
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.415855355Z" level=info msg="devmapper: Creating filesystem xfs on device docker-202:1-352328870-base"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.551678692Z" level=info msg="devmapper: Successfully created filesystem xfs on device docker-202:1-352328870-base"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.674373157Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.674413916Z" level=info msg="Initializing discovery with TLS"
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.722352220Z" level=info msg="Loading containers: start."
Dec 15 00:40:13 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:13.722438367Z" level=info msg="Option Initializing KV with TLS"
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:14.992434302Z" level=info msg="2017/12/15 00:40:14 [INFO] serf: EventMemberJoin: ip-10-10-2-39 10.10.2.39
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: "
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:14.999344260Z" level=error msg="joining serf neighbor 10.10.2.39 failed: Failed to join the cluster at neigh IP 10.10.2.127: 1 error(s) occurred:
Dec 15 00:40:14 ip-10-10-2-39 dockerd[15797]: * Failed to join 10.10.2.127: dial tcp 10.10.2.127:7946: getsockopt: connection refused"
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:15.000971589Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-77 10.10.2.77
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: time="2017-12-15T00:40:15.001047174Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-122 10.10.2.122
Dec 15 00:40:15 ip-10-10-2-39 dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.351785574Z" level=info msg="Loading containers: done."
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.351834341Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-186 10.10.2.186
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.364524778Z" level=info msg="Docker daemon" commit=19e2cf6 graphdriver(s)=devicemapper version=17.09.1-ce
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.364643045Z" level=info msg="Daemon has completed initialization"
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.372985688Z" level=info msg="2017/12/15 00:40:15 [INFO] serf: EventMemberJoin: ip-10-10-2-127 10.10.2.127
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.374887725Z" level=info msg="API listen on [::]:2376"
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal systemd[1]: Started Docker Application Container Engine.
-- Subject: Unit docker.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has finished starting up.
--
-- The start-up result is done.
Dec 15 00:40:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:40:15.374958646Z" level=info msg="API listen on /var/run/docker.sock"
Dec 15 00:43:40 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:40.581427400Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:41 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:41.581709777Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:42 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:42.582011510Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:43 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:43.582278440Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:44 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:44.582540380Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:45 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:45.582809896Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:46.584500194Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:47.586258805Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:43:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:47.586875346Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.26 02:42:0a:32:30:1a cardinality:2 db state:Set{{948b9200505f4f702d5c745b161c941537ff16d8735cced81ded356091209a1d 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:43:48 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:43:48.588299858Z" level=error msg="could not resolve peer \"10.50.48.12\": timed out resolving peer by querying the cluster"
Dec 15 00:44:42 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:42.228925625Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:43 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:43.235079630Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:44 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:44.235249108Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:45 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:45.235502623Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:46.236353923Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:44:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:46.237015773Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.51 02:42:0a:32:30:33 cardinality:2 db state:Set{{a83296771927045209a3f4986b8d54819f861b70ae9357f813bbd6beffb4a350 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:44:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:44:47.238280566Z" level=error msg="could not resolve peer \"10.50.48.9\": timed out resolving peer by querying the cluster"
Dec 15 00:45:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:21.813035003Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.51 02:42:0a:32:30:33 cardinality:1 db state:Set{{dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:21.813113462Z" level=warning msg="Neighbor entry already present for IP 10.50.48.51, mac 02:42:0a:32:30:33 neighbor:&{dstIP:[10 50 48 51] dstMac:[2 66 10 50 48 51] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:45:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:21.813139381Z" level=warning msg="Peer operation failed:could not add neighbor entry for nid:dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 eid:dummy into the sandbox:Search neighbor failed for IP 10.50.48.51, mac 02:42:0a:32:30:33, present in db:true op:&{2 dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 a83296771927045209a3f4986b8d54819f861b70ae9357f813bbd6beffb4a350 [10 50 48 51] [255 255 252 0] [2 66 10 50 48 51] [0 0 0 0 0 0 0 0 0 0 255 255 10 10 2 39] false false true Leave}"
Dec 15 00:45:24 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:24.555935217Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.26 02:42:0a:32:30:1a cardinality:1 db state:Set{{dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:29 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:29.061280932Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.26 02:42:0a:32:30:1a cardinality:2 db state:Set{{8384c22dd90f28304cbff4b50d59a22ea1e89cef573e3f68328d9238b35b3362 10.10.2.186 22 32 false}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:32.281175806Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.51 02:42:0a:32:30:33 cardinality:2 db state:Set{{16035419d19fdb10db6da6d90924f0c630ce279cc0fd7aa9003b3ce7df30e09a 10.10.2.122 22 32 false}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:45:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:32.281235119Z" level=warning msg="Neighbor entry already present for IP 10.50.48.51, mac 02:42:0a:32:30:33 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 51] dstMac:[2 66 10 50 48 51] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:45:36 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:36.037319650Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:37 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:37.037589884Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:38 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:38.037839646Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:39 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:39.038093412Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:40 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:40.038438277Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:41 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:41.038748530Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:42 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:42.039029344Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:43 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:43.040062307Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:44 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:44.040738594Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:45 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:45.041500982Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:46 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:46.041736518Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:47 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:47.041944529Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:48 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:48.042211415Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:49 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:49.042602371Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:50 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:50.042877823Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:51 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:51.043060362Z" level=error msg="could not resolve peer \"10.50.48.5\": timed out resolving peer by querying the cluster"
Dec 15 00:45:52 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:52.043401234Z" level=error msg="could not resolve peer \"10.50.48.13\": timed out resolving peer by querying the cluster"
Dec 15 00:45:53 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:53.043678780Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:54 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:54.044024680Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:55 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:55.044331507Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:45:56 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:56.044669511Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:45:57 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:57.044938297Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:45:58 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:58.045236075Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:45:59 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:45:59.045511083Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:46:00 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:00.045775091Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:01 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:01.046082853Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:02 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:02.046398348Z" level=error msg="could not resolve peer \"10.50.48.14\": timed out resolving peer by querying the cluster"
Dec 15 00:46:03 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:03.046680381Z" level=error msg="could not resolve peer \"10.50.48.20\": timed out resolving peer by querying the cluster"
Dec 15 00:46:04 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:04.046902046Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:05 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:05.047049988Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:05 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:05.047751298Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.32 02:42:0a:32:30:20 cardinality:2 db state:Set{{6e13374231d5792b3b16d8c551c9567b37b8e61744feb5f495fba7b66b368b6e 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:46:06 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:06.047943322Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:07 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:07.048273526Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:08 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:08.048642394Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:09 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:09.050425793Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:10 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:10.053972757Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:11 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:11.054348470Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:12 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:12.054840861Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:13 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:13.061006391Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:14 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:14.061270824Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:15 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:15.061522089Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:16 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:16.061839113Z" level=error msg="could not resolve peer \"10.50.48.25\": timed out resolving peer by querying the cluster"
Dec 15 00:46:17 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:17.062076822Z" level=error msg="could not resolve peer \"10.50.48.24\": timed out resolving peer by querying the cluster"
Dec 15 00:46:18 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:18.062385094Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:19 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:19.062696554Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:20 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:20.064790638Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.067539938Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068683195Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.32 02:42:0a:32:30:20 cardinality:2 db state:Set{{dummy 10.10.2.39 22 32 false}, {6e13374231d5792b3b16d8c551c9567b37b8e61744feb5f495fba7b66b368b6e 10.10.2.39 22 32 true}}"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068714041Z" level=warning msg="Entry already present in db: nid:dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 eid:dummy peerIP:10.50.48.32 peerMac:02:42:0a:32:30:20 isLocal:false vtep:10.10.2.39"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068752275Z" level=warning msg="Neighbor entry already present for IP 10.50.48.32, mac 02:42:0a:32:30:20 neighbor:&{dstIP:[10 50 48 32] dstMac:[2 66 10 50 48 32] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:true"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068868897Z" level=warning msg="Neighbor entry already present for IP 10.10.2.39, mac 02:42:0a:32:30:20 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 10 2 39] dstMac:[2 66 10 50 48 32] linkName:vx-000100-dcf48 linkDst:vxlan0 family:7} forceUpdate:false"
Dec 15 00:46:21 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:21.068903013Z" level=warning msg="Peer operation failed:could not add fdb entry for nid:dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 eid:dummy into the sandbox:Search neighbor failed for IP 10.10.2.39, mac 02:42:0a:32:30:20, present in db:true op:&{1 dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 dummy [10 50 48 32] [255 255 252 0] [2 66 10 50 48 32] [0 0 0 0 0 0 0 0 0 0 255 255 10 10 2 39] false true false watchMiss}"
Dec 15 00:46:22 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:22.070154351Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:22 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:22.070956149Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.72 02:42:0a:32:30:48 cardinality:2 db state:Set{{1a3ef204892e76e1cbfcca7576f28bbfba022cd1e49c6e897a62e2a0738911a3 10.10.2.39 22 32 true}, {dummy 10.10.2.39 22 32 false}}"
Dec 15 00:46:23 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:23.086049272Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:24 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:24.086365853Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:25 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:25.086694403Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:26 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:26.087032562Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:46:27 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:27.090327610Z" level=error msg="could not resolve peer \"10.50.48.4\": timed out resolving peer by querying the cluster"
Dec 15 00:46:28 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:28.100048345Z" level=error msg="could not resolve peer \"10.50.48.17\": timed out resolving peer by querying the cluster"
Dec 15 00:46:29 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:46:29.112959868Z" level=error msg="could not resolve peer \"10.50.48.23\": timed out resolving peer by querying the cluster"
Dec 15 00:47:23 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:23.283265485Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:2 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}, {76c8bf940ed1f0422345bcb9d37c8db2067a48161c8b6ff6689038cd6c05ebcb 10.10.2.127 22 32 false}}"
Dec 15 00:47:23 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:23.283330673Z" level=warning msg="Neighbor entry already present for IP 10.50.48.80, mac 02:42:0a:32:30:50 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 80] dstMac:[2 66 10 50 48 80] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:47:25 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:25.677710734Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:1 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}}"
Dec 15 00:47:28 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:28.854566986Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:2 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}, {78c2e91119b59a60216c8aea38319bbf7d70ca5b7288f427dfc4cd319cfc8fe1 10.10.2.186 22 32 false}}"
Dec 15 00:47:28 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:28.854643458Z" level=warning msg="Neighbor entry already present for IP 10.50.48.80, mac 02:42:0a:32:30:50 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 80] dstMac:[2 66 10 50 48 80] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:47:30 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:30.860523220Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:1 db state:Set{{9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}}"
Dec 15 00:47:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:32.068871701Z" level=warning msg="peerDbAdd transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:2 db state:Set{{fe6cefb9b974800e2917a350b6357aceed271a6dbbbf4b4da45d380196180380 10.10.2.127 22 32 false}, {9fe2a0870a6f14dac6751c83b62cce30c18082eaae11dea5e2d43d5a168ad776 10.10.2.127 22 32 false}}"
Dec 15 00:47:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:32.068921895Z" level=warning msg="Neighbor entry already present for IP 10.50.48.80, mac 02:42:0a:32:30:50 neighbor:&{dstIP:[0 0 0 0 0 0 0 0 0 0 255 255 10 50 48 80] dstMac:[2 66 10 50 48 80] linkName:vx-000100-dcf48 linkDst:vxlan0 family:0} forceUpdate:false"
Dec 15 00:47:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:47:32.494377690Z" level=warning msg="peerDbDelete transient condition - Key:10.50.48.80 02:42:0a:32:30:50 cardinality:1 db state:Set{{fe6cefb9b974800e2917a350b6357aceed271a6dbbbf4b4da45d380196180380 10.10.2.127 22 32 false}}"
Dec 15 00:55:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:55:32.937897501Z" level=warning msg="2017/12/15 00:55:32 [WARN] serf: received old event jl 10.10.2.77 dcf48923f107727f755869f05ab10ba24a56444737f981c898ec14b2032ea952 92afa9f48852c73e1d8c41b904e2eed40e561ac9ed51966d1d7062397548118a from time 373 (current: 894)
Dec 15 00:55:32 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Dec 15 00:56:14 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: time="2017-12-15T00:56:14.118302987Z" level=error msg="Error running exec in container: oci runtime error: exec failed: container_linux.go:265: starting container process caused \"process_linux.go:84: executing setns process caused \\"exit status 15\\"\"
Dec 15 00:56:14 ip-10-10-2-39.us-west-2.compute.internal dockerd[15797]: "
Describe the results you expected:
Containers on all hosts should be able to connect to each other.
Additional information you deem important (e.g. issue happens only occasionally):
Intermittent/randomly occurs. Much more frequent on larger clusters.
Output of docker version:
Tested with 17.05, 17.09.1, and 17.11.0. I don't have a 17.05 up currently to get the docker version output.
Client:
Version: 17.09.1-ce
API version: 1.32
Go version: go1.8.3
Git commit: 19e2cf6
Built: Thu Dec 7 22:23:40 2017
OS/Arch: linux/amd64
Server:
Version: 17.09.1-ce
API version: 1.32 (minimum version 1.12)
Go version: go1.8.3
Git commit: 19e2cf6
Built: Thu Dec 7 22:25:03 2017
OS/Arch: linux/amd64
Experimental: false
Client:
Version: 17.11.0-ce
API version: 1.34
Go version: go1.8.3
Git commit: 1caf76c
Built: Mon Nov 20 18:35:47 2017
OS/Arch: linux/amd64
Server:
Version: 17.11.0-ce
API version: 1.34 (minimum version 1.12)
Go version: go1.8.3
Git commit: 1caf76c
Built: Mon Nov 20 18:45:06 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info:
Containers: 6
Running: 5
Paused: 0
Stopped: 1
Images: 5
Server Version: 17.09.1-ce
Storage Driver: devicemapper
Pool Name: docker-202:1-352328870-pool
Pool Blocksize: 65.54kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 1.729GB
Data Space Total: 107.4GB
Data Space Available: 105.6GB
Metadata Space Used: 3.641MB
Metadata Space Total: 2.147GB
Metadata Space Available: 2.144GB
Thin Pool Minimum Free Space: 10.74GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-693.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.79GiB
Name: ip-10-10-2-39.us-west-2.compute.internal
ID: 73EN:WWPI:2QJ4:2GPV:4I4Z:BHMS:EZBX:BF75:HAJ7:M7CO:LLSA:FYJW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Cluster Store: consul://10.10.2.39:8080
Cluster Advertise: 10.10.2.39:2376
Insecure Registries:
docker-exabeam.repo.exabeam.com
127.0.0.0/8
Live Restore Enabled: true
WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Containers: 3
Running: 3
Paused: 0
Stopped: 0
Images: 2
Server Version: 17.11.0-ce
Storage Driver: devicemapper
Pool Name: docker-202:1-486539717-pool
Pool Blocksize: 65.54kB
Base Device Size: 10.74GB
Backing Filesystem: xfs
Udev Sync Supported: true
Data file: /dev/loop0
Metadata file: /dev/loop1
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Data Space Used: 1.23GB
Data Space Total: 107.4GB
Data Space Available: 106.1GB
Metadata Space Used: 2.036MB
Metadata Space Total: 2.147GB
Metadata Space Available: 2.145GB
Thin Pool Minimum Free Space: 10.74GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.140-RHEL7 (2017-05-03)
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 992280e8e265f491f7a624ab82f3e238be086e49
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-693.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.79GiB
Name: ip-10-10-2-23.us-west-2.compute.internal
ID: QTRG:VGMN:HOW5:4IJ6:RHTG:3GNF:SVP6:OF53:KI4M:H647:CV64:45VR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Cluster Store: consul://10.10.2.23:8080
Cluster Advertise: 10.10.2.23:2376
Insecure Registries:
docker-exabeam.repo.exabeam.com
127.0.0.0/8
Live Restore Enabled: true
WARNING: devicemapper: usage of loopback devices is strongly discouraged for production use.
Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS and physical, running Centos 7. Tested with consul 0.7.2 and 1.0.1.
Restarting the docker daemon on unconnectable hosts sometimes fixes it for that host.
daemon.json
{
"bip": "172.17.0.1/16",
"cluster-advertise": "10.10.2.39:2376",
"cluster-store": "consul://10.10.2.39:8080",
"cluster-store-opts": {
"kv.cacertfile": "/etc/docker/ssl/exa-ca.pem",
"kv.certfile": "/etc/docker/ssl/docker-cert.pem",
"kv.keyfile": "/etc/docker/ssl/docker-key.pem"
},
"disable-legacy-registry": true,
"dns": [],
"hosts": [
"unix:///var/run/docker.sock",
"tcp://0.0.0.0:2376"
],
"insecure-registries": [
"docker-exabeam.repo.exabeam.com"
],
"live-restore": true,
"log-driver": "none",
"shutdown-timeout": 90,
"storage-driver": "devicemapper",
"tlsverify": true,
"tlscacert": "/etc/docker/ssl/exa-ca.pem",
"tlscert": "/etc/docker/ssl/docker-cert.pem",
"tlskey": "/etc/docker/ssl/docker-key.pem"
}
consul config file:
{
"data_dir":"/var/lib/consul",
"enable_syslog":false,
"server":true,
"node_name":"host1",
"advertise_addr":"10.10.2.39",
"bind_addr":"10.10.2.39",
"bootstrap_expect": 5,
"retry_join": ["10.10.2.39","10.10.2.43","10.10.2.52","10.10.2.58","10.10.2.69"],
"client_addr": "0.0.0.0",
"verify_outgoing": true,
"verify_incoming": true,
"verify_server_hostname": false,
"encrypt": "BuZSrsHSx1MCQ9j1kFuQmA==",
"key_file": "/opt/consul/etc/consul-key.pem",
"cert_file": "/opt/consul/etc/consul-cert.pem",
"ca_file": "/opt/consul/etc/exa-ca.pem",
"ports": {
"https": 8080,
"http": -1
}
}
Possibly related to #32841 , #32195 , or #33721 , but I'm not using swarm, so I figured it warranted its own ticket.
Description
Overlay network randomly has specific machines unreachable. Containers on these nodes are unable to reach containers on other nodes, and vice-versa. This happens much more frequently on larger clusters, but we've seen it on smaller ones as well.
We are using consul for our kv store.
Steps to reproduce the issue:
Describe the results you received:
Containers on some nodes are unable to connect to containers on other nodes. They can connect to containers on the same node.
Journal logs:
Describe the results you expected:
Containers on all hosts should be able to connect to each other.
Additional information you deem important (e.g. issue happens only occasionally):
Intermittent/randomly occurs. Much more frequent on larger clusters.
Output of
docker version:Tested with 17.05, 17.09.1, and 17.11.0. I don't have a 17.05 up currently to get the
docker versionoutput.Output of
docker info:Additional environment details (AWS, VirtualBox, physical, etc.):
AWS and physical, running Centos 7. Tested with consul 0.7.2 and 1.0.1.
Restarting the docker daemon on unconnectable hosts sometimes fixes it for that host.
daemon.json
consul config file:
Possibly related to #32841 , #32195 , or #33721 , but I'm not using swarm, so I figured it warranted its own ticket.