Skip to content

Handle cancelled on_complete for host subtasks#12632

Closed
jellevandenhooff wants to merge 2 commits intobytecodealliance:mainfrom
jellevandenhooff:fix-on-complete-after-cancel
Closed

Handle cancelled on_complete for host subtasks#12632
jellevandenhooff wants to merge 2 commits intobytecodealliance:mainfrom
jellevandenhooff:fix-on-complete-after-cancel

Conversation

@jellevandenhooff
Copy link
Copy Markdown
Contributor

While working on a program with many outgoing DNS requests that also got cancelled I ran into a race with subtask management:

crates/wasmtime/src/runtime/component/concurrent.rs:5108:37:
     called `Result::unwrap()` on an `Err` value: NotPresent

The included reproducing test fails before the fix and works after.

Code and commit message by Claude. It looks sane to me, but I am not sure about eating the error in the branch; please consider if this makes sense. If the commit message is too long I am happy to rewrite it human-style.

Per the component-model spec (CanonicalABI.md), `subtask.cancel` on a
subtask that has already resolved collects the pending event and
returns `RETURNED`.  A subsequent `subtask.drop` is valid because
`resolve_delivered` is true at that point.

In the implementation, when an async-lowered host function's future
completes, a `WorkerFunction` (on_complete) is scheduled via the
high-priority work queue to lower the result and deliver the
`Returned` event.  Between the future completing and on_complete
running, another work item in the same batch (e.g. a `ResumeFiber`
delivering a different subtask's event) may allow the guest to
`subtask.cancel` + `subtask.drop` this task, removing it from the
table.  When on_complete then runs, it tries to look up the deleted
task's scope via `call_context`, causing a `NotPresent` panic in
`validate_scope_exit`.

Guard on_complete by checking whether the task still exists in the
table and whether its `join_handle` is still present (taken by
`subtask.cancel`).  In either case, the guest already observed the
resolution and `cancel_scope` released any outstanding borrows, so
on_complete is a no-op.

A new test (`cancel_completed_host_task_does_not_crash`) exercises
the race deterministically: two async host functions that yield once
then complete; the guest waits for the first, then cancels the second
whose on_complete is still queued.
@jellevandenhooff jellevandenhooff requested a review from a team as a code owner February 21, 2026 01:43
@jellevandenhooff jellevandenhooff requested review from pchickey and removed request for a team February 21, 2026 01:43
@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Feb 21, 2026
@jellevandenhooff jellevandenhooff force-pushed the fix-on-complete-after-cancel branch from fd4e5f8 to 25f1380 Compare February 21, 2026 06:06
@jellevandenhooff
Copy link
Copy Markdown
Contributor Author

Okay, then ran into a similar but related issue where a cancelled task's on_complete handler was able to steal a replacement task's on_complete. The test is kind of gnarly and I am concerned it's not deterministic. What it tries to show is that:

  • task A would run
  • task A gets cancelled
  • task B would run, with task A's original handle
  • task A's on_complete would run... and signal success to what is now task B's handle
  • task B's on_complete would never run
  • now the guest is very confused because B's results are garbage
    The epoch fix seems clean, but I am sure there might be other approaches. Without the fixes in either commit both tests fail.

@jellevandenhooff jellevandenhooff changed the title Skip on_complete for already-cancelled host subtasks Handle cancelled on_complete for host subtasks Feb 21, 2026
The previous guard checked `join_handle.is_none()` or table lookup
failure, but this doesn't catch the case where a cancelled+dropped
host task's table slot is reused by a new host task before the stale
on_complete runs. The new entry has `join_handle = Some`, so the
guard passes and the stale closure steals the new task's join_handle,
writes to the wrong retptr, and fires a spurious Returned event.

Add a monotonic `epoch` field to HostTask, incremented for each new
host task. The on_complete closure captures the epoch at creation
time and compares it against the current occupant's epoch. If they
differ, the slot was reused and the closure bails out.

Add a regression test that deterministically reproduces the slot
reuse scenario using FuturesUnordered LIFO polling order.
@jellevandenhooff jellevandenhooff force-pushed the fix-on-complete-after-cancel branch from 25f1380 to 42ab9bf Compare February 21, 2026 06:23
@alexcrichton
Copy link
Copy Markdown
Member

Thanks for the PR (and tests!)

Upon reading this it's actually related to what I was thinking of when I was reviewing the internals of #12631. I think the fix I have in mind there will resolve these issues too. So, like that PR, I'll work a bit locally and post back here with results. Many thanks for the report & tests & fix!

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 23, 2026
This commit refactors some of the internals of `subtask.cancel` with
respect to host subtasks. Notably a few panics and semantic bugs are
fixed here. The main bug was that host subtasks could be aborted but
their completion might have still been queued up which would produce the
result somewhere or assert that the task exists. Cancellation is changed
to use `wait_for_event` to ensure that this completion is executed
before `subtask.cancel` returns. This helps keep host subtasks looking
more similar to guest subtasks in that respect.

Co-authored-by: Jelle van den Hooff <[email protected]>

Closes bytecodealliance#12631
Closes bytecodealliance#12632
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 23, 2026
This commit refactors some of the internals of `subtask.cancel` with
respect to host subtasks. Notably a few panics and semantic bugs are
fixed here. The main bug was that host subtasks could be aborted but
their completion might have still been queued up which would produce the
result somewhere or assert that the task exists. Cancellation is changed
to use `wait_for_event` to ensure that this completion is executed
before `subtask.cancel` returns. This helps keep host subtasks looking
more similar to guest subtasks in that respect.

Closes bytecodealliance#12631
Closes bytecodealliance#12632

Co-authored-by: Jelle van den Hooff <[email protected]>
@alexcrichton
Copy link
Copy Markdown
Member

Ok I've pushed up a "more official fix" to #12640 which includes the tests here and should resolve them. Thanks again @jellevandenhooff!

github-merge-queue bot pushed a commit that referenced this pull request Feb 23, 2026
This commit refactors some of the internals of `subtask.cancel` with
respect to host subtasks. Notably a few panics and semantic bugs are
fixed here. The main bug was that host subtasks could be aborted but
their completion might have still been queued up which would produce the
result somewhere or assert that the task exists. Cancellation is changed
to use `wait_for_event` to ensure that this completion is executed
before `subtask.cancel` returns. This helps keep host subtasks looking
more similar to guest subtasks in that respect.

Closes #12631
Closes #12632

Co-authored-by: Jelle van den Hooff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants