Skip to content

Mitigate the impact of slow exec starts on health checks#43480

Merged
thaJeztah merged 1 commit into
moby:masterfrom
corhere:mitigate-slow-health-check-start
Apr 29, 2022
Merged

Mitigate the impact of slow exec starts on health checks#43480
thaJeztah merged 1 commit into
moby:masterfrom
corhere:mitigate-slow-health-check-start

Conversation

@corhere
Copy link
Copy Markdown
Contributor

@corhere corhere commented Apr 12, 2022

- What I did
While investigating a report of slow docker execs and failing health checks on heavily-loaded systems, I instrumented dockerd with OpenTelemetry tracing (https://github.com/corhere/moby/tree/otel-trace) and captured traces with tens of health-checked containers running concurrently. I was surprised to find that—on my machine and setup—the containerd.services.tasks.v1.Tasks/Start RPC could take upwards of a full second to complete with ~50 health-checked containers.

for i in {0..50}; do
  docker run -d --name healthstress${i} \
  --health-cmd true --health-interval 1s --health-timeout 5s \
  alpine sleep infinity
done

Screenshot of Jaeger UI showing a trace of a health check

The timeout for a container health check starts the moment dockerd starts setting up the exec for the probe command. Consequently, the time it takes to setup the command execution, including all containerd RPCs, takes away from the time the probe command has to run to completion. I have yet to determine the root cause of the latency, but in the meantime I have attempted to mitigate the worst impacts by not counting it against the probe command's timeout.

- How I did it
I moved the the probe-timeout timer into the probe implementation, and started the timer only once the probe command has started up.

- How to verify it
Test that container health checks and timeouts continue to function as before.

# Should be unhealthy
docker run --health-cmd "sleep 5" --health-interval 1s --health-timeout 2s -d --name healthtest -d alpine sleep infinity
# Should be healthy
docker run --health-cmd true --health-interval 1s --health-timeout 2s -d --name healthtest -d alpine sleep infinity

Configure dockerd to expose the metrics server, make an HTTP request against it (curl localhost:9323/metrics) and check that the newly-added engine_daemon_health_check_start_duration_seconds histogram metric is returned.

- Description for the changelog

  • Health check timeout now applies only to the duration that the health-check command is running. The time it takes to start the command no longer counts against the timeout.

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere force-pushed the mitigate-slow-health-check-start branch 2 times, most recently from f579e48 to be2f260 Compare April 14, 2022 20:50
@corhere
Copy link
Copy Markdown
Contributor Author

corhere commented Apr 14, 2022

Adding a timeout to starting the exec as discussed in the maintainer call looks to have done the trick: tests are no longer hanging. PTAL @thaJeztah @cpuguy83 @tianon @tonistiigi

@cpuguy83
Copy link
Copy Markdown
Member

cpuguy83 commented Apr 21, 2022

Execs are expected to only have second-precision.
It may happen faster than a second but cannot be guaranteed.
Consequently the timeouts for execs only use second precision as well.

I'm not sure I understand the the purpose of the change given that.
If someone's healthcheck probes are timing out due to the latency involved in starting up the actual exec, most likely the timeout for that needs to be adjusted because it is much too small.
If the system is overloaded and cannot run the exec within the timeout period, then most likely the service is unhealthy, and at the very least we are unable to determine if it is healthy or not.

@thaJeztah thaJeztah added this to the 22.04.0 milestone Apr 21, 2022
@thaJeztah
Copy link
Copy Markdown
Member

gave CI a kick to do another run 👍

@corhere
Copy link
Copy Markdown
Contributor Author

corhere commented Apr 22, 2022

@cpuguy83 The swagger docs for the ContainerCreate API document the Healthcheck field as "[a] test to perform to check that the container is healthy." A health check is supposed to probe the health of a container, independently from the container runtime. Applying the timeout to the entire cmdProbe.run call lumps the runtime's health with that of the container, blaming the container if the runtime takes too long. It would be unreasonable to expect users to take the runtime time into consideration when configuring the timeout as health check configuration can be baked into images, which are runtime-independent, but the amount of time consumed by the runtime could vary significantly by runtime flavor and version.

Execs are expected to only have second-precision.
It may happen faster than a second but cannot be guaranteed.
Consequently the timeouts for execs only use second precision as well.

I'm not sure I understand the the purpose of the change given that.

The purpose of the change is to decouple timeouts for health checks from the time it takes to start an exec'd process precisely because the exec start time is so large and variable.

The Healthcheck.Timeout field is documented to be "[t]he time to wait before considering the check to have hung. It should be 0 or at least 1000000 (1 ms)." The Dockerfile reference for the HEALTHCHECK instruction describes it similarly. There is no mention of the timeout encompassing both the time it takes to run the command inside the container and the time it takes for the command inside the container to run. I would expect a health check timeout to cover exclusively the time it takes for the configured command to run, and I suspect that most users would think the same. The other interpretation leads to absurdities where health checks with short timeouts can time out before the command has even started.

If someone's healthcheck probes are timing out due to the latency involved in starting up the actual exec, most likely the timeout for that needs to be adjusted because it is much too small. If the system is overloaded and cannot run the exec within the timeout period, then most likely the service is unhealthy, and at the very least we are unable to determine if it is healthy or not.

Not necessarily. Consider the case of a container configured with reserved CPU and memory. The dockerd, containerd and shim are running in the host namespace, and therefore the work required to start the health-check exec does not get to leverage the container's reserved resources. The exec'd health check process, however, does get to take advantage of the dedicated resources reserved for the container. The container could keep ticking along, happy and responsive, even while dockerd and containerd are bogged down by an overloaded system. Without this change, that container's health check could fail even if the exec'd command takes exactly the same amount of time to run to completion, simply because it takes dockerd and containerd more time than usual to exec the command.

Comment thread daemon/health.go Outdated
Comment thread daemon/health.go
// start the exec is time that the probe process is not running, and so
// should not count towards the health check's timeout. Apply a separate
// timeout to abort if the exec request is wedged.
tm := time.NewTimer(30 * time.Second)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely concerned about how large this is, but the added prometheus metric (and @corhere's commitment to keep digging) makes me feel good about it. 👍

Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, otherwise seems ok as discussed on our call.

Starting an exec can take a significant amount of time while under heavy
container operation load. In extreme cases the time to start the process
can take upwards of a second, which is a significant fraction of the
default health probe timeout (30s). With a shorter timeout, the exec
start delay could make the difference between a successful probe and a
probe timeout! Mitigate the impact of excessive exec start latencies by
only starting the probe timeout timer after the exec'ed process has
started.

Add a metric to sample the latency of starting health-check exec probes.

Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the mitigate-slow-health-check-start branch from 5654059 to bdc6473 Compare April 28, 2022 21:21
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Copy Markdown
Member

windows failure is unrelated (TestNetworkDBIslands, known flaky test)

@thaJeztah thaJeztah merged commit 545cf19 into moby:master Apr 29, 2022
@corhere corhere deleted the mitigate-slow-health-check-start branch April 29, 2022 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants