Mitigate the impact of slow exec starts on health checks by corhere · Pull Request #43480 · moby/moby

corhere · 2022-04-12T23:20:15Z

Relates to How many docker health checks can be handled in a single docker node ? #33933

- What I did
While investigating a report of slow docker execs and failing health checks on heavily-loaded systems, I instrumented dockerd with OpenTelemetry tracing (https://github.com/corhere/moby/tree/otel-trace) and captured traces with tens of health-checked containers running concurrently. I was surprised to find that—on my machine and setup—the containerd.services.tasks.v1.Tasks/Start RPC could take upwards of a full second to complete with ~50 health-checked containers.

for i in {0..50}; do
  docker run -d --name healthstress${i} \
  --health-cmd true --health-interval 1s --health-timeout 5s \
  alpine sleep infinity
done

The timeout for a container health check starts the moment dockerd starts setting up the exec for the probe command. Consequently, the time it takes to setup the command execution, including all containerd RPCs, takes away from the time the probe command has to run to completion. I have yet to determine the root cause of the latency, but in the meantime I have attempted to mitigate the worst impacts by not counting it against the probe command's timeout.

- How I did it
I moved the the probe-timeout timer into the probe implementation, and started the timer only once the probe command has started up.

- How to verify it
Test that container health checks and timeouts continue to function as before.

# Should be unhealthy
docker run --health-cmd "sleep 5" --health-interval 1s --health-timeout 2s -d --name healthtest -d alpine sleep infinity
# Should be healthy
docker run --health-cmd true --health-interval 1s --health-timeout 2s -d --name healthtest -d alpine sleep infinity

Configure dockerd to expose the metrics server, make an HTTP request against it (curl localhost:9323/metrics) and check that the newly-added engine_daemon_health_check_start_duration_seconds histogram metric is returned.

- Description for the changelog

Health check timeout now applies only to the duration that the health-check command is running. The time it takes to start the command no longer counts against the timeout.

- A picture of a cute animal (not mandatory but encouraged)

corhere · 2022-04-14T22:00:59Z

Adding a timeout to starting the exec as discussed in the maintainer call looks to have done the trick: tests are no longer hanging. PTAL @thaJeztah @cpuguy83 @tianon @tonistiigi

cpuguy83 · 2022-04-21T18:34:12Z

Execs are expected to only have second-precision.
It may happen faster than a second but cannot be guaranteed.
Consequently the timeouts for execs only use second precision as well.

I'm not sure I understand the the purpose of the change given that.
If someone's healthcheck probes are timing out due to the latency involved in starting up the actual exec, most likely the timeout for that needs to be adjusted because it is much too small.
If the system is overloaded and cannot run the exec within the timeout period, then most likely the service is unhealthy, and at the very least we are unable to determine if it is healthy or not.

thaJeztah · 2022-04-21T20:00:11Z

gave CI a kick to do another run 👍

corhere · 2022-04-22T17:05:54Z

@cpuguy83 The swagger docs for the ContainerCreate API document the Healthcheck field as "[a] test to perform to check that the container is healthy." A health check is supposed to probe the health of a container, independently from the container runtime. Applying the timeout to the entire cmdProbe.run call lumps the runtime's health with that of the container, blaming the container if the runtime takes too long. It would be unreasonable to expect users to take the runtime time into consideration when configuring the timeout as health check configuration can be baked into images, which are runtime-independent, but the amount of time consumed by the runtime could vary significantly by runtime flavor and version.

Execs are expected to only have second-precision.
It may happen faster than a second but cannot be guaranteed.
Consequently the timeouts for execs only use second precision as well.

I'm not sure I understand the the purpose of the change given that.

The purpose of the change is to decouple timeouts for health checks from the time it takes to start an exec'd process precisely because the exec start time is so large and variable.

The Healthcheck.Timeout field is documented to be "[t]he time to wait before considering the check to have hung. It should be 0 or at least 1000000 (1 ms)." The Dockerfile reference for the HEALTHCHECK instruction describes it similarly. There is no mention of the timeout encompassing both the time it takes to run the command inside the container and the time it takes for the command inside the container to run. I would expect a health check timeout to cover exclusively the time it takes for the configured command to run, and I suspect that most users would think the same. The other interpretation leads to absurdities where health checks with short timeouts can time out before the command has even started.

If someone's healthcheck probes are timing out due to the latency involved in starting up the actual exec, most likely the timeout for that needs to be adjusted because it is much too small. If the system is overloaded and cannot run the exec within the timeout period, then most likely the service is unhealthy, and at the very least we are unable to determine if it is healthy or not.

Not necessarily. Consider the case of a container configured with reserved CPU and memory. The dockerd, containerd and shim are running in the host namespace, and therefore the work required to start the health-check exec does not get to leverage the container's reserved resources. The exec'd health check process, however, does get to take advantage of the dedicated resources reserved for the container. The container could keep ticking along, happy and responsive, even while dockerd and containerd are bogged down by an overloaded system. Without this change, that container's health check could fail even if the exec'd command takes exactly the same amount of time to run to completion, simply because it takes dockerd and containerd more time than usual to exec the command.

tianon · 2022-04-28T18:29:15Z

+	// start the exec is time that the probe process is not running, and so
+	// should not count towards the health check's timeout. Apply a separate
+	// timeout to abort if the exec request is wedged.
+	tm := time.NewTimer(30 * time.Second)


Definitely concerned about how large this is, but the added prometheus metric (and @corhere's commitment to keep digging) makes me feel good about it. 👍

cpuguy83

One nit, otherwise seems ok as discussed on our call.

Starting an exec can take a significant amount of time while under heavy container operation load. In extreme cases the time to start the process can take upwards of a second, which is a significant fraction of the default health probe timeout (30s). With a shorter timeout, the exec start delay could make the difference between a successful probe and a probe timeout! Mitigate the impact of excessive exec start latencies by only starting the probe timeout timer after the exec'ed process has started. Add a metric to sample the latency of starting health-check exec probes. Signed-off-by: Cory Snider <[email protected]>

thaJeztah

LGTM

thaJeztah · 2022-04-29T13:07:27Z

windows failure is unrelated (TestNetworkDBIslands, known flaky test)

corhere force-pushed the mitigate-slow-health-check-start branch 2 times, most recently from f579e48 to be2f260 Compare April 14, 2022 20:50

thaJeztah added this to the 22.04.0 milestone Apr 21, 2022

thaJeztah added the status/2-code-review label Apr 21, 2022

rumpl mentioned this pull request Apr 28, 2022

investigate terminal resize issues, and (if possible) set size on creation docker/cli#3554

Closed

corhere force-pushed the mitigate-slow-health-check-start branch from be2f260 to 5654059 Compare April 28, 2022 17:35

cpuguy83 reviewed Apr 28, 2022

View reviewed changes

Comment thread daemon/health.go Outdated

tianon approved these changes Apr 28, 2022

View reviewed changes

cpuguy83 approved these changes Apr 28, 2022

View reviewed changes

tonistiigi approved these changes Apr 28, 2022

View reviewed changes

corhere force-pushed the mitigate-slow-health-check-start branch from 5654059 to bdc6473 Compare April 28, 2022 21:21

tianon approved these changes Apr 28, 2022

View reviewed changes

thaJeztah added the impact/changelog label Apr 29, 2022

thaJeztah approved these changes Apr 29, 2022

View reviewed changes

thaJeztah merged commit 545cf19 into moby:master Apr 29, 2022

corhere deleted the mitigate-slow-health-check-start branch April 29, 2022 16:08

pgi-jsanchez mentioned this pull request May 30, 2022

Very slow container start fabiocicerchia/nginx-lua#51

Closed

corhere mentioned this pull request Jul 4, 2022

don't use canceled context to send KILL signal to healthcheck process #43739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate the impact of slow exec starts on health checks#43480

Mitigate the impact of slow exec starts on health checks#43480
thaJeztah merged 1 commit into
moby:masterfrom
corhere:mitigate-slow-health-check-start

corhere commented Apr 12, 2022 •

edited

Loading

Uh oh!

corhere commented Apr 14, 2022

Uh oh!

cpuguy83 commented Apr 21, 2022 •

edited

Loading

Uh oh!

thaJeztah commented Apr 21, 2022

Uh oh!

corhere commented Apr 22, 2022

Uh oh!

Uh oh!

tianon Apr 28, 2022

Uh oh!

cpuguy83 left a comment

Uh oh!

thaJeztah left a comment

Uh oh!

thaJeztah commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

corhere commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corhere commented Apr 14, 2022

Uh oh!

cpuguy83 commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah commented Apr 21, 2022

Uh oh!

corhere commented Apr 22, 2022

Uh oh!

Uh oh!

tianon Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

cpuguy83 left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Apr 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

corhere commented Apr 12, 2022 •

edited

Loading

cpuguy83 commented Apr 21, 2022 •

edited

Loading