fix(cron): default enabled to true for new jobs#8698
fix(cron): default enabled to true for new jobs#8698emmick4 wants to merge 3 commits intoopenclaw:mainfrom
Conversation
Adds a new lifecycle state module that tracks gateway process state to disk. This enables crash detection on restart: - markLifecycleStarting() - called on startup, detects if previous run crashed - markLifecycleRunning() - called after successful startup - markLifecycleStopping() - called when shutdown begins - markLifecycleStopped() - called after clean shutdown If the previous state was 'running' but we're starting up, we know the gateway crashed. This is the foundation for recovery hooks that can resume interrupted work (GUPP-style persistence). Part of self-architecture project to enable more autonomous operation.
Implements a checkpoint system that allows agents to save progress on long-running tasks. Checkpoints survive crashes and can be used to resume interrupted work. New module: src/agents/checkpoint.ts - saveCheckpoint() - Save task progress - loadCheckpoint() - Load checkpoint by session ID - clearCheckpoint() - Remove checkpoint when work complete - listActiveCheckpoints() - Find all active checkpoints - updateCheckpoint() - Create or update checkpoint - findStaleCheckpoints() - Find checkpoints that may need attention Integration with lifecycle-state.ts: - On crash detection, lists any interrupted checkpoints - Logs interrupted work on startup for visibility This builds on the earlier lifecycle-state module to provide a complete crash recovery foundation for OpenClaw agents.
Jobs created via the cron.add tool/API were silently disabled because `enabled` was not set by default. The scheduler treated `undefined` as disabled, so jobs never ran. This fix adds `enabled: true` to the defaults applied during job creation, matching the documented behavior that jobs are enabled by default. Fixes openclaw#6988
| // Mark lifecycle as stopping for crash detection | ||
| await markLifecycleStopping(); | ||
|
|
There was a problem hiding this comment.
markLifecycleStopping() is awaited before any shutdown work, but markLifecycleStopped() is only reached if all later cleanup steps succeed. If any awaited call throws (e.g. tailscaleCleanup(), stopChannel(), or httpServer.close()), the lifecycle file will stay at stopping and the next startup won’t be classified as a crash (it’ll be unknown). Consider wrapping the shutdown body in try { ... } finally { await markLifecycleStopped(); } (or otherwise ensuring stopped is recorded on clean exits even when some teardown steps fail).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-close.ts
Line: 35:37
Comment:
`markLifecycleStopping()` is awaited before any shutdown work, but `markLifecycleStopped()` is only reached if *all* later cleanup steps succeed. If any awaited call throws (e.g. `tailscaleCleanup()`, `stopChannel()`, or `httpServer.close()`), the lifecycle file will stay at `stopping` and the next startup won’t be classified as a crash (it’ll be `unknown`). Consider wrapping the shutdown body in `try { ... } finally { await markLifecycleStopped(); }` (or otherwise ensuring `stopped` is recorded on clean exits even when some teardown steps fail).
How can I resolve this? If you propose a fix, please make it concise.| export async function markLifecycleStarting(): Promise<PreviousShutdown | null> { | ||
| const previous = await loadLifecycleState(); | ||
| let previousShutdown: PreviousShutdown | null = null; | ||
|
|
||
| if (previous) { | ||
| const baseInfo = { | ||
| time: previous.lastHeartbeat, | ||
| uptime: previous.lastHeartbeat - previous.startedAt, | ||
| }; | ||
|
|
||
| if (previous.status === "running") { | ||
| // Previous state was "running" but we're starting up = crash | ||
| // Check for interrupted checkpoints | ||
| const checkpoints = await listActiveCheckpoints(); | ||
| const interrupted = checkpoints.filter((cp) => cp.progress < 100); |
There was a problem hiding this comment.
Crash detection here hinges on previous.status === "running", but nothing ever calls updateLifecycleHeartbeat() in this PR. Since markLifecycleRunning() sets lastHeartbeat only once, the reported crash time/uptime will be misleading for long-running gateways (it will look like it crashed shortly after start). Either wire updateLifecycleHeartbeat() into an existing periodic timer/heartbeat event, or remove the lastHeartbeat/uptime reporting until it’s actually updated regularly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/lifecycle-state.ts
Line: 56:70
Comment:
Crash detection here hinges on `previous.status === "running"`, but nothing ever calls `updateLifecycleHeartbeat()` in this PR. Since `markLifecycleRunning()` sets `lastHeartbeat` only once, the reported crash time/uptime will be misleading for long-running gateways (it will look like it crashed shortly after start). Either wire `updateLifecycleHeartbeat()` into an existing periodic timer/heartbeat event, or remove the `lastHeartbeat`/uptime reporting until it’s actually updated regularly.
How can I resolve this? If you propose a fix, please make it concise.| export async function saveCheckpoint(checkpoint: Checkpoint): Promise<void> { | ||
| await fs.mkdir(CHECKPOINT_DIR, { recursive: true }); | ||
| const filePath = path.join(CHECKPOINT_DIR, `${checkpoint.sessionId}.json`); | ||
| const tmp = `${filePath}.${process.pid}.tmp`; | ||
| await fs.writeFile(tmp, JSON.stringify(checkpoint, null, 2), "utf-8"); | ||
| await fs.rename(tmp, filePath); | ||
| } |
There was a problem hiding this comment.
saveCheckpoint() writes to a temp file and renames it, but if the process crashes between writeFile and rename, listActiveCheckpoints() currently ignores *.tmp files, so those checkpoints become invisible (and markLifecycleStarting() won’t report them as interrupted). If you want crash recovery to be reliable, consider either including .tmp files in listActiveCheckpoints() (best-effort parse) or cleaning them up / using a more robust atomic-write pattern on platforms where rename may fail (e.g. cross-device).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/checkpoint.ts
Line: 28:34
Comment:
`saveCheckpoint()` writes to a temp file and renames it, but if the process crashes between `writeFile` and `rename`, `listActiveCheckpoints()` currently ignores `*.tmp` files, so those checkpoints become invisible (and `markLifecycleStarting()` won’t report them as interrupted). If you want crash recovery to be reliable, consider either including `.tmp` files in `listActiveCheckpoints()` (best-effort parse) or cleaning them up / using a more robust atomic-write pattern on platforms where `rename` may fail (e.g. cross-device).
How can I resolve this? If you propose a fix, please make it concise.bfc1ccb to
f92900f
Compare
Problem
Jobs created via the
cron.addtool/API were silently disabled becauseenabledwas not set by default. The scheduler treatedundefinedas disabled, so jobs never ran.Solution
Add
enabled: trueto the defaults applied during job creation innormalize.ts, matching the documented behavior that jobs are enabled by default.Changes
src/cron/normalize.tssrc/cron/normalize.test.ts:enabled: falseis preservedFixes #6988
Greptile Overview
Greptile Summary
This PR fixes cron job creation defaults by setting
enabled: truewhenenabledis omitted insrc/cron/normalize.ts, with new tests ensuring the default is applied and explicitenabled: falseis preserved.In addition, it introduces a new disk-backed gateway lifecycle/crash detection mechanism (
src/infra/lifecycle-state.ts) plus a checkpoint persistence layer (src/agents/checkpoint.ts), and wires lifecycle transitions into gateway startup/shutdown (src/gateway/server.impl.ts,src/gateway/server-close.ts). On startup, it logs a warning if the previous run appears to have crashed and surfaces any incomplete checkpoints.Main things to double-check are the shutdown path reliably marking
stoppedeven when teardown steps fail, and that lifecycle heartbeats are actually updated periodically so crash time/uptime reporting is accurate.Confidence Score: 3/5
updateLifecycleHeartbeat()call, and failures during shutdown can prevent markingstopped, affecting future crash classification.(2/5) Greptile learns from your feedback when you react with thumbs up/down!