Skip to content

Unable to stop/cancel running jobs after worker terminated by k8s #2040

@pommetjehorlepiep

Description

@pommetjehorlepiep

Dagu version: 2.6.1
5 Workers on k8s deployed using Helm chart (1.0.6)
Backend: SQLite

Problem description

Worker had active DAG running when it was terminated by K8s.
Expected the workflow to fail, but stayed running.
No cleanup of jobs which were running on now non-existent workers

  • Stop from UI: does nothing
  • Stop/Start Dagu: Still shows as running in UI
  • No way to remove it

Scheduler log bits

time=2026-04-26T19:12:08.333+10:00 level=INFO msg="Scheduler initialization" dir=/data/dags log-format=text
time=2026-04-26T19:12:08.338+10:00 level=INFO msg="Starting service registry" service=scheduler service-id=dagu-scheduler-55d47645d4-ttgth-1-1777194728 host=dagu-scheduler-55d47645d4-ttgth port=8090 status=inactive
time=2026-04-26T19:12:08.370+10:00 level=INFO msg="Registered with service registry as inactive" service-id=dagu-scheduler-55d47645d4-ttgth-1-1777194728 host=dagu-scheduler-55d47645d4-ttgth port=8090
time=2026-04-26T19:12:08.370+10:00 level=INFO msg="Waiting to acquire scheduler lock"
time=2026-04-26T19:12:08.370+10:00 level=INFO msg="Starting health check server" service=scheduler port=8090
time=2026-04-26T19:12:08.407+10:00 level=INFO msg="Acquired scheduler lock"
time=2026-04-26T19:12:08.409+10:00 level=INFO msg="Updated scheduler status to active"
time=2026-04-26T19:12:08.411+10:00 level=INFO msg="Queue watcher setup complete" dir=/data/queue
time=2026-04-26T19:12:08.412+10:00 level=INFO msg="Loading DAGs" dir=/data/dags
time=2026-04-26T19:12:08.458+10:00 level=INFO msg="Loaded scheduler watermark" lastTick=2026-04-26T19:08:00.000+10:00 dagCount=3
time=2026-04-26T19:12:08.458+10:00 level=INFO msg="Scheduler started"
time=2026-04-26T19:12:08.458+10:00 level=INFO msg="Started zombie detector" interval=45s
time=2026-04-26T19:12:08.458+10:00 level=INFO msg="Started retry scanner" interval=30s retry_failure_window=24h0m0s
time=2026-04-26T19:12:53.462+10:00 level=WARN msg="Proc entry appears stale, waiting for threshold" dag=updater run-id=019d69be-b897-7270-8ee1-1a412fa85413 attempt-id=a5477d queue=updater stale_count=1 threshold=3
time=2026-04-26T19:12:53.462+10:00 level=WARN msg="Proc entry appears stale, waiting for threshold" dag=updater run-id=019d6c88-d504-77ce-9160-609be3eb1821 attempt-id=401fe7 queue=updater stale_count=1 threshold=3
time=2026-04-26T19:12:53.462+10:00 level=WARN msg="Proc entry appears stale, waiting for threshold" dag=updater run-id=019db4d8-cb84-74f2-a967-5559c9e6268b attempt-id=904c8e queue=updater stale_count=1 threshold=3
:::
time=2026-04-26T19:13:38.462+10:00 level=WARN msg="Proc entry appears stale, waiting for threshold" dag=updater run-id=019d69be-b897-7270-8ee1-1a412fa85413 attempt-id=a5477d queue=updater stale_count=2 threshold=3
time=2026-04-26T19:13:38.462+10:00 level=WARN msg="Proc entry appears stale, waiting for threshold" dag=updater run-id=019d6c88-d504-77ce-9160-609be3eb1821 attempt-id=401fe7 queue=updater stale_count=2 threshold=3
time=2026-04-26T19:13:38.462+10:00 level=WARN msg="Proc entry appears stale, waiting for threshold" dag=updater run-id=019db4d8-cb84-74f2-a967-5559c9e6268b attempt-id=904c8e queue=updater stale_count=2 threshold=3
:::
time=2026-04-26T19:14:23.487+10:00 level=ERROR msg="Failed to check zombie status" name=updater run-id=019d69be-b897-7270-8ee1-1a412fa85413 attempt-id=a5477d err="find attempt: no status data"
time=2026-04-26T19:14:23.491+10:00 level=ERROR msg="Failed to check zombie status" name=updater run-id=019d6c88-d504-77ce-9160-609be3eb1821 attempt-id=401fe7 err="find attempt: no status data"
time=2026-04-26T19:14:23.494+10:00 level=ERROR msg="Failed to check zombie status" name=updater run-id=019db4d8-cb84-74f2-a967-5559c9e6268b attempt-id=904c8e err="find attempt: no status data"

Coordinator logs

time=2026-04-26T19:12:08.400+10:00 level=INFO msg="Coordinator initialization" bind-address=0.0.0.0 advertise-address=dagu-coordinator.default.svc.k8s.cluster port=50055 instance-id=dagu-coordinator-86c87599cf-hx4mf@50055
time=2026-04-26T19:12:08.401+10:00 level=INFO msg="Started zombie detector" interval=45s
time=2026-04-26T19:12:08.401+10:00 level=INFO msg="Starting service registry" service=coordinator service-id=dagu-coordinator-86c87599cf-hx4mf@50055 host=dagu-coordinator.default.svc.k8s.cluster port=50055 status=active
time=2026-04-26T19:12:08.407+10:00 level=INFO msg="Registered with service registry" service-id=dagu-coordinator-86c87599cf-hx4mf@50055 configured-host=dagu-coordinator.default.svc.k8s.cluster port=50055 addr=[::]:50055
time=2026-04-26T19:12:08.407+10:00 level=INFO msg="Starting to serve on coordinator service" addr=[::]:50055
time=2026-04-26T19:12:08.407+10:00 level=INFO msg="Starting health check server" service=coordinator port=8091
time=2026-04-26T19:12:08.610+10:00 level=ERROR msg="Failed to fail stale distributed run" run-id=GL3cKtu4mFbDcbjHd871W1zHw2G8aeVaE7Knn8BF3yNe expected_status=not_started err="dag-run ID not found: GL3cKtu4mFbDcbjHd871W1zHw2G8aeVaE7Knn8BF3yNe"
time=2026-04-26T19:12:08.687+10:00 level=ERROR msg="Failed to fail stale distributed run" run-id=DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP expected_status=running err="dag-run ID not found: DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP"
time=2026-04-26T19:12:08.804+10:00 level=ERROR msg="Failed to fail stale distributed run" run-id=DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP expected_status=running err="dag-run ID not found: DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP"
time=2026-04-26T19:12:53.977+10:00 level=ERROR msg="Failed to fail stale distributed run" run-id=GL3cKtu4mFbDcbjHd871W1zHw2G8aeVaE7Knn8BF3yNe expected_status=not_started err="dag-run ID not found: GL3cKtu4mFbDcbjHd871W1zHw2G8aeVaE7Knn8BF3yNe"
time=2026-04-26T19:12:54.075+10:00 level=ERROR msg="Failed to fail stale distributed run" run-id=DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP expected_status=running err="dag-run ID not found: DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP"
time=2026-04-26T19:12:54.238+10:00 level=ERROR msg="Failed to fail stale distributed run" run-id=DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP expected_status=running err="dag-run ID not found: DVaFAU73nSEm5Fgf1jxdUJRkmUHLLjAhdqxbwiuisxuP"
time=2026-04-26T19:13:11.035+10:00 level=INFO msg="Looking up DAG attempt for cancellation" dag=updater run-id=019dc715-fc01-7bb0-8f7e-74602a403bb1
time=2026-04-26T19:13:11.038+10:00 level=INFO msg="DAG run cancellation requested successfully" dag=updater run-id=019dc715-fc01-7bb0-8f7e-74602a403bb1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions