[red-knot] Fix CLI hang when a dependent query panics#17631
[red-knot] Fix CLI hang when a dependent query panics#17631MichaReiser merged 2 commits intomainfrom
Conversation
|
8e3a07b to
c85b945
Compare
c85b945 to
49e9981
Compare
c53a283 to
b34bcd6
Compare
carljm
left a comment
There was a problem hiding this comment.
Nice! Thank you so much for tracking this down.
The fix looks good, and also provides a better user experience for panics.
Is my understanding correct that this means new ecosystem panics will now again not show up as a failure in the ecosystem job, and instead just as a diagnostic output diff? I think that's OK, but it does mean we need to be careful to check ecosystem output on our diffs.
crates/red_knot_project/src/lib.rs
Outdated
| "This indicates a bug in Red Knot.", | ||
| )); | ||
|
|
||
| let report_message = "If you could open an issue at https://github.com/astral-sh/ruff/issues/new?title=%5BRed%20Knot%20panic%5D, we'd be very appreciative!"; |
There was a problem hiding this comment.
nit: today we use [red-knot] prefix on all our issues and PRs, can we stay consistent with that? e.g. [red-knot] panic: maybe?
Of course we'll have to change this again soon :)
|
Looks like clippy is not happy? |
|
That's correct. But panics come first with Andrew's new sorting and should be easy to discover because of it (unless it gets truncated). |
b34bcd6 to
7de4d9f
Compare
7de4d9f to
db36802
Compare
Maybe we could still exit with a code different from 1 (type checking failed) and 2 (some other red knot error), like before? In this case, it would still be easy to detect panics (require no changes in mypy_primer). |
Oh, you already proposed that change in #17640. In that case, mypy_primer CI runs will still fail in case of panics. Edit: well, not quite, still uses error code 2 |
We could change the error code to something else but 2 is what Ruff/Red Knot already used for other errors. |
Summary
Fixes #17537
Red Knot's CLI sometimes hangs after encountering a panic. This was due to non-determinsim related to how panics are propagated by rayon and Salsa.
Ctrl + Ccommands and incoming file watcher changesrayon::spawn. Rayon terminates the entire process if the task scheduled withrayon::spawnpanics (we don't have any other panic handling today).db.checkright away which callsproject.check, but is wrapped in asalsa::Cancelled::catch. I assumed that this only catches cancellations because of a pending write (e.g. a file watcher change) but it turns out, that it also catches panics if thread A depends on a query running in thread B and thread B panics.project.checkwhere we use a thread pool and spawn a task for every file.project.checkusesrayon::scope: Unlikerayon::spawn, it propagates panics instead of terminating the process. However, it propagates an arbitrary panic if two threads panic (which is the case if thread A depends on a query in thread B and B panics).What happened is that sometimes rayon propagated the
Cancelled::PropagatedPanic(the panic raised by salsa in thread A that depends on the panicking query running in thread B) panic over the actual panic in thread B, which then got swallowed by ourCancelled::catchwrapper insidedb.check.We should probably change Salsa to use a different mechanism to handle thread-dependent panics because this is a logical error, whereas a pending write is not.
This PR fixes the hang by repeating the
Cancelled::catchinside each spawned thread. This has the advantage that we propagate the real panic from thread B and never the placeholderCancelled::PropagatedPanicfrom thread A (which doesn't contain any useful debug information). More specifically, we now catch panics insidecheck_file_impland create a diagnostic for them.Ideally, we'd capture the backtrace too but that's a) more complicated and b) mostly empty in production builds.
Test Plan
I ran red knot on hydpy for a couple of minutes and I couldn't reproduce the hang anymore (normally reproduces after 30s or so)