Skip to content

fix(a2a): drain stale loopback events (#2302), detect stale PID on startup (#2295)#2318

Merged
bug-ops merged 3 commits intomainfrom
2302-a2a-empty-response
Mar 28, 2026
Merged

fix(a2a): drain stale loopback events (#2302), detect stale PID on startup (#2295)#2318
bug-ops merged 3 commits intomainfrom
2302-a2a-empty-response

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 28, 2026

Summary

  • bug(a2a): message/send returns completed task with no artifacts when LLM response is empty #2302message/send returned completed with no artifacts on second and subsequent requests. Root cause: after AgentTaskProcessor consumed LoopbackEvent::FullMessage and broke out of the recv loop, a trailing LoopbackEvent::Flush (emitted by flush_chunks()) remained buffered in output_rx. The next request consumed this stale event first, produced an empty ArtifactChunk, and broke — never seeing the real LLM response. Fix: drain output_rx with try_recv() after every recv-loop exit so no stale events survive into the next request.

  • bug(a2a): daemon PID file not cleaned on abnormal exit — restart requires manual cleanup #2295 — A PID file left by a crashed or SIGKILL'd daemon blocked restarts with WARN: failed to write PID file: File exists. Fix: before writing the PID file, read any existing file and check whether the stored PID refers to a live process (kill -0). If stale → remove and proceed. If alive → bail with an actionable error message.

Changes

  • crates/zeph-core/src/daemon.rs — add is_process_alive(pid: u32) -> bool (Unix: kill -0, non-Unix: always false)
  • src/daemon.rs — drain loop after recv loop (while let Ok(_) = handle.output_rx.try_recv() {}); stale-PID check in run_daemon before write_pid_file
  • src/tests.rs — 5 new tests covering both fixes
  • CHANGELOG.md[Unreleased] section updated

Test plan

  • cargo +nightly fmt --check — passes
  • cargo clippy --features full --workspace -- -D warnings — passes
  • cargo nextest run --config-file .github/nextest.toml --workspace --features full --lib --bins — 6915/6915 passed
  • New tests: is_process_alive_current_process, is_process_alive_nonexistent_pid, loopback_stale_flush_drained_after_full_message, stale_pid_detection_dead_process, stale_pid_detection_live_process

@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate bug Something isn't working size/M Medium PR (51-200 lines) labels Mar 28, 2026
@bug-ops bug-ops force-pushed the 2302-a2a-empty-response branch from 329a5a1 to 5715fe5 Compare March 28, 2026 07:54
@bug-ops bug-ops enabled auto-merge (squash) March 28, 2026 07:55
bug-ops added 3 commits March 28, 2026 09:08
…ct stale PID on startup (#2295)

- Drain remaining events from `output_rx` with `try_recv()` after breaking
  out of the recv loop in `AgentTaskProcessor::process`; prevents stale
  `Flush` events emitted by `flush_chunks()` from bleeding into the next
  request and producing empty artifacts.
- Add `is_process_alive(pid)` to `zeph-core::daemon`; read and liveness-check
  an existing PID file before writing a new one — remove if stale, bail if
  the process is still alive.
- Add unit tests: `is_process_alive_{current,nonexistent}_pid`,
  `loopback_stale_flush_drained_after_full_message`,
  `stale_pid_detection_{dead,live}_process`.

Closes #2302, closes #2295
@bug-ops bug-ops force-pushed the 2302-a2a-empty-response branch from a2bba5d to 5868674 Compare March 28, 2026 08:08
@bug-ops bug-ops merged commit 2a842c9 into main Mar 28, 2026
25 checks passed
@bug-ops bug-ops deleted the 2302-a2a-empty-response branch March 28, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/M Medium PR (51-200 lines)

Projects

None yet

1 participant