Mitigate the impact of slow exec starts on health checks#43480
Conversation
f579e48 to
be2f260
Compare
|
Adding a timeout to starting the exec as discussed in the maintainer call looks to have done the trick: tests are no longer hanging. PTAL @thaJeztah @cpuguy83 @tianon @tonistiigi |
|
Execs are expected to only have second-precision. I'm not sure I understand the the purpose of the change given that. |
|
gave CI a kick to do another run 👍 |
|
@cpuguy83 The swagger docs for the ContainerCreate API document the Healthcheck field as "[a] test to perform to check that the container is healthy." A health check is supposed to probe the health of a container, independently from the container runtime. Applying the timeout to the entire
The purpose of the change is to decouple timeouts for health checks from the time it takes to start an exec'd process precisely because the exec start time is so large and variable. The
Not necessarily. Consider the case of a container configured with reserved CPU and memory. The dockerd, containerd and shim are running in the host namespace, and therefore the work required to start the health-check exec does not get to leverage the container's reserved resources. The exec'd health check process, however, does get to take advantage of the dedicated resources reserved for the container. The container could keep ticking along, happy and responsive, even while dockerd and containerd are bogged down by an overloaded system. Without this change, that container's health check could fail even if the exec'd command takes exactly the same amount of time to run to completion, simply because it takes dockerd and containerd more time than usual to exec the command. |
be2f260 to
5654059
Compare
| // start the exec is time that the probe process is not running, and so | ||
| // should not count towards the health check's timeout. Apply a separate | ||
| // timeout to abort if the exec request is wedged. | ||
| tm := time.NewTimer(30 * time.Second) |
There was a problem hiding this comment.
Definitely concerned about how large this is, but the added prometheus metric (and @corhere's commitment to keep digging) makes me feel good about it. 👍
cpuguy83
left a comment
There was a problem hiding this comment.
One nit, otherwise seems ok as discussed on our call.
Starting an exec can take a significant amount of time while under heavy container operation load. In extreme cases the time to start the process can take upwards of a second, which is a significant fraction of the default health probe timeout (30s). With a shorter timeout, the exec start delay could make the difference between a successful probe and a probe timeout! Mitigate the impact of excessive exec start latencies by only starting the probe timeout timer after the exec'ed process has started. Add a metric to sample the latency of starting health-check exec probes. Signed-off-by: Cory Snider <[email protected]>
5654059 to
bdc6473
Compare
|
windows failure is unrelated ( |
- What I did
While investigating a report of slow
docker execs and failing health checks on heavily-loaded systems, I instrumenteddockerdwith OpenTelemetry tracing (https://github.com/corhere/moby/tree/otel-trace) and captured traces with tens of health-checked containers running concurrently. I was surprised to find that—on my machine and setup—thecontainerd.services.tasks.v1.Tasks/StartRPC could take upwards of a full second to complete with ~50 health-checked containers.The timeout for a container health check starts the moment
dockerdstarts setting up the exec for the probe command. Consequently, the time it takes to setup the command execution, including all containerd RPCs, takes away from the time the probe command has to run to completion. I have yet to determine the root cause of the latency, but in the meantime I have attempted to mitigate the worst impacts by not counting it against the probe command's timeout.- How I did it
I moved the the probe-timeout timer into the probe implementation, and started the timer only once the probe command has started up.
- How to verify it
Test that container health checks and timeouts continue to function as before.
Configure
dockerdto expose the metrics server, make an HTTP request against it (curl localhost:9323/metrics) and check that the newly-addedengine_daemon_health_check_start_duration_secondshistogram metric is returned.- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)
