Skip to content

Container HEALTHCHECKs can lead to hanging API calls and bad State #36661

@jahkeup

Description

@jahkeup

Description

Occasionally the Docker daemon stops responding to interactions with a containers that have HEALTHCHECKs. The problem presents itself in several older versions of Docker and the latest packaged versions on Ubuntu and v17.12.0-ce on Amazon Linux.

By our observations, this looks to be a race condition that is met with a deadlock that prevents further calls against affected containers.

This observed issue may be related to #35933 . I'm working on bisecting the releases (using https://github.com/docker/docker-ce) using this repro to narrow down the problem further in any case.

Steps to reproduce the issue:

A repro case has been built and run against several version of docker with positive results after a few rounds of execution (I recommend 10-20 rounds to tickle the bug). There likely isn't anything specific about the 2 containers, but it has been positively triggering the bug for this test.

  1. Build container image with HEALTHCHECK defined (echo hello every 1s)
  2. Start 2 containers using image
  3. Wait some time (10s in our test)
  4. Stop containers
  5. Inspect containers

Describe the results you received:

Started containers appear to continue running and to be healthy despite being non-responsive.

ubuntu@ip-172-31-37-156:~$ docker ps
CONTAINER ID        IMAGE                      COMMAND               CREATED             STATUS                    PORTS               NAMES
0cf518c205f7        docker-poke:healthchecks   "sh -c 'sleep 30m'"   17 minutes ago      Up 17 minutes (healthy)                       sad_hugle
ubuntu@ip-172-31-37-156:~$ docker inspect 0cf518c205f7
^C

Additionally, the output of docker ps will continue reporting that the container is still up and running even though the process will exit after 30m (started with sleep 30m).

0cf518c205f7        docker-poke:healthchecks   "sh -c 'sleep 30m'"   37 minutes ago      Up 37 minutes (healthy)                               sad_hugle

Describe the results you expected:

I expected that I would be able to inspect this container.

docker inspect 0cf518c205f7
{
   ...
}

Additional information you deem important (e.g. issue happens only occasionally):

This issue is readily made apparent with a few concurrent runs, but otherwise lies dormant even with many serial runs.

Output of docker version:

Client:
 Version:       17.12.1-ce
 API version:   1.35
 Go version:    go1.9.4
 Git commit:    7390fc6
 Built: Tue Feb 27 22:17:40 2018
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.1-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.4
  Git commit:   7390fc6
  Built:        Tue Feb 27 22:16:13 2018
  OS/Arch:      linux/amd64
  Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 17.12.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 4
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1052-aws
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.625GiB
Name: ip-172-31-37-156
ID: L4W4:V4WA:OHSS:QTGL:DRJG:32GX:7DKK:FFLO:WKR2:IJYV:NKDG:GWRA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):

Ubuntu 16.04 on AWS EC2 using the repro runner

Thanks @samuelkarp!

Package Result
17.09.1~ce-0~ubuntu pass
17.10.0~ce-0~ubuntu pass
17.11.0~ce-0~ubuntu pass
17.12.0~ce~rc1-0~ubuntu fail
17.12.0~ce-0~ubuntu fail
17.12.1~ce-0~ubuntu fail
18.01.0~ce-0~ubuntu fail
18.02.0~ce-0~ubuntu fail
18.03.0~ce~rc4-0~ubuntu fail

Amazon Linux on AWS EC2 using the repro runner

Thanks @jhaynes!

Package Result
17.12.0-ce fail
17.09.1-ce pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/runtimeRuntimekind/bugBugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions