Description
I am pulling my hair out... any ideas for things I should look into would be helpful!! I am all out of ideas.
I have 17 containers running on a custom network (bridge driver). I point containers to my own DNS server. Everything runs great for a few days, but at some point I'll notice essentially all connections between containers on the network, and between the containers and the host OS, are failing. Restarting the docker engine clears the issue for another 2-4 days.
External connections to internet (both coming in and going out) seem to be fine.
I've poured through log files and cannot find a culprit.
- Ping between containers/host/internet all work fine
- DNS lookups to my DNS server (which is running on the same host, on the host OS) fail. DNS lookups to an external server (e.g. google's) succeed. Here's a blurb from dockerd.log
[2023-12-14T20:21:43.728658844Z][dockerd][I] time="2023-12-14T20:21:43.727917261Z" level=error msg="[resolver] failed to query DNS server: 10.0.0.100:53, query: ;ports.ubuntu.com.\tIN\t A" error="read udp 192.168.32.103:53420->10.0.0.100:53: i/o timeout"
[2023-12-14T20:21:47.731933221Z][dockerd][I] time="2023-12-14T20:21:47.731412846Z" level=error msg="[resolver] failed to query DNS server: 10.0.0.100:53, query: ;ports.ubuntu.com.\tIN\t A" error="read udp 192.168.32.103:58153->10.0.0.100:53: i/o timeout"
The DNS queries do not reach the software running on the host OS. I've restarted the DNS software and that does not fix the issue. The DNS server is responding to all other devices' queries on my network, so I'm certain the error is occurring somewhere within the Docker networking and/or host OS networking.
- Websocket connections between containers, on the same docker network, fail. Here's an example of one container trying to open a web socket to another:
UnboundLocalError: cannot access local variable 'r' where it is not associated with a value
2023-12-14 14:42:56 - ERROR :: CP Server Thread-12 : Failed to access uri endpoint /users/account. Connection error: HTTPSConnectionPool(host='plex.tv', port=443): Max retries exceeded with url: /users/account (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff781e8790>: Failed to establish a new connection: [Errno -3] Try again'))
2023-12-14 14:42:56 - WARNING :: CP Server Thread-12 : Tautulli PlexTV :: Unable to parse XML for get_plexpass_status: 'NoneType' object has no attribute 'getElementsByTagName'.
2023-12-14 14:42:57 - DEBUG :: CP Server Thread-8 : IP Checker :: Resolved plex.private to 192.168.32.106.
2023-12-14 14:42:57 - INFO :: CP Server Thread-8 : Tautulli PlexTV :: Requesting resources for server...
2023-12-14 14:42:57 - INFO :: CP Server Thread-8 : Tautulli PlexTV :: Pinging Plex.tv to refresh token.
2023-12-14 14:43:02 - ERROR :: CP Server Thread-8 : Failed to access uri endpoint /api/v2/ping. Connection error: HTTPSConnectionPool(host='plex.tv', port=443): Max retries exceeded with url: /api/v2/ping (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff7a04dc90>: Failed to establish a new connection: [Errno -3] Try again'))
2023-12-14 14:43:02 - WARNING :: CP Server Thread-8 : Tautulli PlexTV :: Unable to parse XML for ping: 'NoneType' object has no attribute 'getElementsByTagName'.
2023-12-14 14:43:04 - ERROR :: CP Server Thread-13 : Failed to access uri endpoint /api/downloads/1.json. Connection error: HTTPSConnectionPool(host='plex.tv', port=443): Max retries exceeded with url: /api/downloads/1.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff78207850>: Failed to establish a new connection: [Errno -3] Try again'))
2023-12-14 14:43:04 - WARNING :: CP Server Thread-13 : Tautulli PlexTV :: Unable to load JSON for get_plex_updates: the JSON object must be str, bytes or bytearray, not NoneType
2023-12-14 14:43:04 - ERROR :: CP Server Thread-6 : WebUI :: /settings#tabs_tabs-plex_media_server : TypeError: undefined is not an object (evaluating 'downloads.computer[platform]'). (settings:5505)
2023-12-14 14:43:07 - ERROR :: CP Server Thread-8 : Failed to access uri endpoint /api/resources?includeHttps=1. Connection error: HTTPSConnectionPool(host='plex.tv', port=443): Max retries exceeded with url: /api/resources?includeHttps=1 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff78207590>: Failed to establish a new connection: [Errno -3] Try again'))
2023-12-14 14:43:07 - WARNING :: CP Server Thread-8 : Tautulli PlexTV :: Unable to parse XML for get_server_urls: 'NoneType' object has no attribute 'getElementsByTagName'.
2023-12-14 14:43:12 - ERROR :: CP Server Thread-8 : Failed to access uri endpoint /users/account. Connection error: HTTPSConnectionPool(host='plex.tv', port=443): Max retries exceeded with url: /users/account (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0xffff792eb690>: Failed to establish a new connection: [Errno -3] Try again'))
2023-12-14 14:43:12 - WARNING :: CP Server Thread-8 : Tautulli PlexTV :: Unable to parse XML for get_plexpass_status: 'NoneType' object has no attribute 'getElementsByTagName'.
2023-12-14 14:43:12 - DEBUG :: CP Server Thread-8 : Testing secure websocket connection...
2023-12-14 14:43:17 - ERROR :: CP Server Thread-8 : Websocket connection test failed: [Errno -3] Try again
Possible clues:
The docker subnet defined in the desktop app settings is 192.168.65.0/24 (the default). I've noticed in init.log repeating these repeating errors.
[2023-12-14T20:48:56.106136461Z][init][I] [2023-12-14T20:48:56.105950586Z][init][I] eth1: DHCPv4
[2023-12-14T20:48:59.118487629Z][init][I] [2023-12-14T20:48:59.118322379Z][init][I] eth1: DHCPv4 failure: timed out while listening for replies
[2023-12-14T20:49:00.122143379Z][init][I] [2023-12-14T20:49:00.122028296Z][init][I] eth1: DHCPv4
[2023-12-14T20:49:03.130142047Z][init][I] [2023-12-14T20:49:03.130094464Z][init][I] eth1: DHCPv4 failure: timed out while listening for replies
My DHCP server shows repeating DHCP requests coming from 192.168.65.3. unbolt.000-dhcpcd.log shows that IP address was offered from vpnkit, but I don't know exactly what is using that IP address or why it's sending DHCP messages to my DHCP server. I suspect this is not related to my issue but I can't find any other relevant errors in the docker logs.
Reproduce
It will be difficult for others to re-produce. In my case, it's allowing a stable multi-docker installation to run for a period of days until failure.
Expected behavior
Network connections between the containers and the host using the custom bridge network should remain stable.
docker version
Client:
Cloud integration: v1.0.35+desktop.5
Version: 24.0.6
API version: 1.43
Go version: go1.20.7
Git commit: ed223bc
Built: Mon Sep 4 12:28:49 2023
OS/Arch: darwin/arm64
Context: desktop-linux
Server: Docker Desktop 4.25.2 (129061)
Engine:
Version: 24.0.6
API version: 1.43 (minimum version 1.12)
Go version: go1.20.7
Git commit: 1a79695
Built: Mon Sep 4 12:31:36 2023
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: 1.6.22
GitCommit: 8165feabfdfe38c65b599c4993d227328c231fca
runc:
Version: 1.1.8
GitCommit: v1.1.8-0-g82f18fe
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Version: 24.0.6
Context: desktop-linux
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2-desktop.5
Path: /Users/jacob/.docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.23.0-desktop.1
Path: /Users/jacob/.docker/cli-plugins/docker-compose
dev: Docker Dev Environments (Docker Inc.)
Version: v0.1.0
Path: /Users/jacob/.docker/cli-plugins/docker-dev
extension: Manages Docker extensions (Docker Inc.)
Version: v0.2.20
Path: /Users/jacob/.docker/cli-plugins/docker-extension
init: Creates Docker-related starter files for your project (Docker Inc.)
Version: v0.1.0-beta.9
Path: /Users/jacob/.docker/cli-plugins/docker-init
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
Version: 0.6.0
Path: /Users/jacob/.docker/cli-plugins/docker-sbom
scan: Docker Scan (Docker Inc.)
Version: v0.26.0
Path: /Users/jacob/.docker/cli-plugins/docker-scan
scout: Docker Scout (Docker Inc.)
Version: v1.0.9
Path: /Users/jacob/.docker/cli-plugins/docker-scout
Server:
Containers: 17
Running: 17
Paused: 0
Stopped: 0
Images: 18
Server Version: 24.0.6
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
runc version: v1.1.8-0-g82f18fe
init version: de40ad0
Security Options:
seccomp
Profile: unconfined
cgroupns
Kernel Version: 6.4.16-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: aarch64
CPUs: 8
Total Memory: 8.729GiB
Name: linuxkit-0e3b4880c424
ID: 8998ca3f-fffc-43a8-8471-66e3ec92d688
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5555
127.0.0.0/8
Live Restore Enabled: false
WARNING: daemon is not using the default seccomp profile
Diagnostics ID
C4EA34A4-CB8E-4BE0-81EE-9F972775E736/20231214205155
Additional Info
No response
Description
I am pulling my hair out... any ideas for things I should look into would be helpful!! I am all out of ideas.
I have 17 containers running on a custom network (bridge driver). I point containers to my own DNS server. Everything runs great for a few days, but at some point I'll notice essentially all connections between containers on the network, and between the containers and the host OS, are failing. Restarting the docker engine clears the issue for another 2-4 days.
External connections to internet (both coming in and going out) seem to be fine.
I've poured through log files and cannot find a culprit.
The DNS queries do not reach the software running on the host OS. I've restarted the DNS software and that does not fix the issue. The DNS server is responding to all other devices' queries on my network, so I'm certain the error is occurring somewhere within the Docker networking and/or host OS networking.
Possible clues:
The docker subnet defined in the desktop app settings is 192.168.65.0/24 (the default). I've noticed in init.log repeating these repeating errors.
My DHCP server shows repeating DHCP requests coming from 192.168.65.3. unbolt.000-dhcpcd.log shows that IP address was offered from vpnkit, but I don't know exactly what is using that IP address or why it's sending DHCP messages to my DHCP server. I suspect this is not related to my issue but I can't find any other relevant errors in the docker logs.
Reproduce
It will be difficult for others to re-produce. In my case, it's allowing a stable multi-docker installation to run for a period of days until failure.
Expected behavior
Network connections between the containers and the host using the custom bridge network should remain stable.
docker version
Client: Cloud integration: v1.0.35+desktop.5 Version: 24.0.6 API version: 1.43 Go version: go1.20.7 Git commit: ed223bc Built: Mon Sep 4 12:28:49 2023 OS/Arch: darwin/arm64 Context: desktop-linux Server: Docker Desktop 4.25.2 (129061) Engine: Version: 24.0.6 API version: 1.43 (minimum version 1.12) Go version: go1.20.7 Git commit: 1a79695 Built: Mon Sep 4 12:31:36 2023 OS/Arch: linux/arm64 Experimental: false containerd: Version: 1.6.22 GitCommit: 8165feabfdfe38c65b599c4993d227328c231fca runc: Version: 1.1.8 GitCommit: v1.1.8-0-g82f18fe docker-init: Version: 0.19.0 GitCommit: de40ad0docker info
Diagnostics ID
C4EA34A4-CB8E-4BE0-81EE-9F972775E736/20231214205155
Additional Info
No response