Skip to content

Container stuck in Running state when system clock steps backwards #52153

@anatolebeuzon

Description

@anatolebeuzon

Description

After upgrading from Docker Engine 29.1.4 to 29.2.1, short-lived containers can get permanently stuck in Running state even though their process has exited. docker wait hangs forever, docker inspect shows Running: true with a dead PID, and no die event is emitted. docker rm -f is the only way to unblock. This happens when the system clock steps backward (e.g., NTP correction after VM snapshot restore) during a container's lifetime.

Root cause

PR #51925 added shouldIgnoreExitEventWithLock in daemon/monitor.go to filter duplicate TaskExit events. The StateRunning case compares e.ExitedAt (recorded by the containerd shim's time.Now()) against c.State.StartedAt (recorded by dockerd's time.Now()):

case containertypes.StateRunning:
    return !e.ExitedAt.IsZero() && e.ExitedAt.Before(c.State.StartedAt)

If CLOCK_REALTIME steps backward between dockerd capturing startupTime (line 225 of start.go) and the shim capturing exitedAt, a legitimate first-and-only exit event is silently dropped.

(SetRunning stores StartedAt via .UTC() which strips the monotonic clock reading, forcing Before() to use wall-clock comparison.)

Logs from production

# Only one create + start, no restart:
docker events:
  04:02:04  container create c47e906d...
  04:02:04  container start  c47e906d...
 
# The exit event was dropped:
dockerd log:
  "ignoring duplicate container exit event"
  container=c47e906d...  state=running  exitCode=0
  exitedAt="2026-03-06 04:02:04.096"
 
docker inspect:
  StartedAt = 2026-03-06T04:02:15.267   (pre-step, ahead clock)
 
# CLOCK_REALTIME stepped backward during the container's lifetime:
journald:
  "Time jumped backwards, rotating" at 04:01:58

exitedAt (04:02:04, post-step corrected clock) < StartedAt (04:02:15, pre-step ahead clock). There was no previous run, and the filter incorrectly classified a legitimate exit as a stale duplicate.

Reproduce

Any backward step of CLOCK_REALTIME during a container's lifetime triggers this. Our case:

  1. Boot a Firecracker VM, let NTP sync, take a snapshot
  2. Restore the snapshot a day later (guest clock is ahead by several seconds due to imprecise TSC offset)
  3. systemd-timesyncd detects the offset and steps the clock backward via clock_adjtime(ADJ_SETOFFSET)
  4. A container started before the step whose process exits after it gets its exit event dropped (exitedAt < startupTime)

Expected behavior

No response

docker version

Client: Docker Engine - Community
 Version:           29.2.1
 API version:       1.53
 Go version:        go1.25.6
 Git commit:        a5c7197
 Built:             Mon Feb  2 17:17:24 2026
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          29.2.1
  API version:      1.53 (minimum version 1.44)
  Go version:       go1.25.6
  Git commit:       6bc6209
  Built:            Mon Feb  2 17:17:24 2026
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v2.2.1
  GitCommit:        dea7da592f5d1d2b7755e3a161be07f43fad8f75
 runc:
  Version:          1.3.4
  GitCommit:        v1.3.4-0-gd6d73eb8
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

n/a

Additional Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/daemonCore Enginekind/bugBugs are bugs. The cause may or may not be known at triage time so debugging may be needed.status/0-triageversion/29.2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions