Skip to content

Fix concurrency/exception bugs in asyncFetchPackages#7929

Merged
Mikolaj merged 11 commits intohaskell:masterfrom
robx:fix-interrupt
Feb 23, 2022
Merged

Fix concurrency/exception bugs in asyncFetchPackages#7929
Mikolaj merged 11 commits intohaskell:masterfrom
robx:fix-interrupt

Conversation

@robx
Copy link
Collaborator

@robx robx commented Jan 25, 2022

This fixes Ctrl-C not interrupting cabal properly while it is downloading. I believe
that might be #6322, although it would be good to verify that.

There are two three concrete bugs:

  • try catches asynchronous exceptions, preventing withAsync from interrupting it when its body action finishes prematurely.
  • the use of withAsync is not safe since we don't wait for the async and hence lose exceptions it might throw
  • the use of regular bracket in withTempFileName can "convert" AsyncCancelled into a regular IO exception

The first bug manifests during cabal build when hitting Ctrl-C during
download, e.g.:

  1. Ctrl-C interrupts a curl process
  2. This interrupt is converted into a UserInterrupt exception
    via the process package's delegate_ctlc functionality.
  3. UserInterrupt bubbles up through fetchPackage, is caught by
    the try (fine) and written to the result mvar.
    Meanwhile, fetchPackage continues its loop, starting to
    fetch the next package.
  4. Some rebuildTarget is waiting for the interrupted download;
    waitAsyncFetchPackage rethrows UserInterrupt in that thread.
  5. UserInterrupt bubbles up through collectJob, execute, the
    body action.
  6. withAsync's finalizer cancels the fetchPackages async.
  7. The fetchPackages async receives the AsyncCancelled exception
    while fetching the next package (see 3. above). This interrupts
    the download action (it shouldn't matter whether we happen to
    be running curl again or not), and AsyncCancelled bubbles up
    through fetchPackage, is caught by the try (not fine!) and
    written to the mvar. But no-one is reading that mvar anymore
    because the build job already aborted (step 5), and that
    AsyncCancelled exception is lost.
  8. fetchPackages keeps doing its thing, exiting eventually.

To fix this, use try from safe-exceptions, which doesn't catch
asynchronous exceptions. Note that this change affects both
instances of try in this story: UserInterrupt is also an asynchronous
exception, so after the change the UserInterrupt takes a different path
from step 3.

This relates to the second bug, uncovered during testing of the fix for
the first bug: UserInterrupt can now cause the async part of
withAsync to crash. That one is fixed by using concurrently instead
of withAsync.

Finally, the third bug showed on Windows CI during testing, where temporary
file cleanup trigged by async cancellation fails with an IO exception, which is
then treated as a regular download failure, while the attempt to cancel the
async is ignored.

  • To verify manually, call cabal build for a package with several undownloaded dependencies, and hit Ctrl-C when it gets started. Prior to these changes, cabal doesn't exit until it's done downloading, afterwards it does. I've checked this on macOS.
  • The PR includes unit tests for asyncFetchPackages with a faked HTTP transport, simulating several failure scenarios where the asyncFetchPackages might not exit promptly or even deadlock. Some of these fail before the change, and now pass.

Please include the following checklist in your PR:

@robx

This comment has been minimized.

@robx robx changed the title Fix lost asynchronous exception in asyncFetchPackages (fixes #6322) Fix lost asynchronous exception in asyncFetchPackages Jan 25, 2022
@robx robx changed the title Fix lost asynchronous exception in asyncFetchPackages Fix concurrency/exception bugs in asyncFetchPackages Jan 26, 2022
@robx
Copy link
Collaborator Author

robx commented Jan 29, 2022

I'm putting some finishing touches on these commits, then I think this should be ready for review.

Need some help with the bootstrap update though, since I don't have a Linux environment ready.
(Incidentally, should those .plan.json files be checked in? They seem transient and contain info about the committer's local dev environment.)

@Mikolaj
Copy link
Member

Mikolaj commented Jan 29, 2022

Great job. Re bootstrap files, I cargo-culted them recently (IIRC you just run a script), but I have no idea how they work. Please ask about them on #hackage IRC/Matrix --- I'd also like to learn.

@robx robx force-pushed the fix-interrupt branch 2 times, most recently from 0bf9d04 to 7b8b934 Compare February 2, 2022 22:35
@robx robx mentioned this pull request Feb 2, 2022
3 tasks
@robx
Copy link
Collaborator Author

robx commented Feb 22, 2022

Any idea who would be a good person to ask to review this? (And is there some way I can make it easier to review?)

@jneira
Copy link
Member

jneira commented Feb 22, 2022

I suppose it should be tested in windows, in addition to Linux, will try if I have time but don't wait for me

@Mikolaj
Copy link
Member

Mikolaj commented Feb 22, 2022

We have waited so long, we can wait a little longer.

@jneira
Copy link
Member

jneira commented Feb 22, 2022

i've tried in windows and it has responded pretty well to ctrl+c, i had only to press it more than one time:

Downloading  fourmolu-0.3.0.0   <--------------------------------------------------- first ctrl+c
Warning: ghc-lib-parser.cabal:149:5: The field "autogen-modules" is available
only since the Cabal specification version 2.0. This field will be ignored.
Warning: ghc-lib-parser.cabal:1:22: Packages with 'cabal-version: 1.12' or
later should specify a specific version of the Cabal spec of the form <------------- again here i think
'cabal-version: x.y'. Use 'cabal-version: 1.22'.
Configuring library for ghc-lib-parser-8.10.7.20210828..
Downloaded   fourmolu-0.3.0.0
Downloading  ormolu-0.1.4.1
Downloaded   ormolu-0.1.4.1
Downloading  stylish-haskell-0.13.0.0
AsyncCancelled <----------------------------------------------------------------------- here it made effect

But it is infinitely better than now for sure, great job

Copy link
Member

@jneira jneira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@robx
Copy link
Collaborator Author

robx commented Feb 22, 2022

Thanks for testing! Unfortunately I don't have access to Windows, would love to dig in to what's happening there. I hope that using withCreateProcess might help there, which I'm planning to do anyway (compare #7921).

@robx robx added the merge me Tell Mergify Bot to merge label Feb 22, 2022
@jneira
Copy link
Member

jneira commented Feb 22, 2022

@Mergifyio rebase master

@mergify
Copy link
Contributor

mergify bot commented Feb 22, 2022

rebase master

✅ Branch has been successfully rebased

@hasufell
Copy link
Member

amazing

@Mikolaj
Copy link
Member

Mikolaj commented Feb 23, 2022

CI jobs are stuck, so let me merge manually.

@Mikolaj
Copy link
Member

Mikolaj commented Feb 23, 2022

@Mergifyio rebase

- Sort modules and tests alphabetically (they were maddeningly close to alphabetic)
- Consistently label unit test groups at top-level
robx added 10 commits February 23, 2022 08:19
This primarily trigger concurrency bugs that are fixed in the
follow-up commits.
This is a step towards fixing interrupt handling during
asynchronous download.

The concrete bug fixed by this changes is  that `try` catches
asynchronous exceptions, preventing `withAsync` from interrupting
it when its `body` action finishes prematurely.

This manifests during `cabal build` when hitting Ctrl-C during
download, e.g.:

1. Ctrl-C interrupts a curl process
2. This interrupt is converted into a UserInterrupt exception
   via the process package's delegate_ctlc functionality.
3. UserInterrupt bubbles up through `fetchPackage`, is caught by
   the `try` (fine) and writen to the result mvar.
   Meanwhile, `fetchPackage` continues its loop, starting to
   fetch the next package.
4. Some `rebuildTarget` is waiting for the interrupted download;
  `waitAsyncFetchPackage` rethrows UserInterrupt in that thread.
5. UserInterrupt bubbles up through `collectJob`, `execute`, the
   `body` action.
6. `withAsync`'s finalizer cancels the `fetchPackages` async.
7. The `fetchPackages` async receives the AsyncCancelled exception
   while fetching the next package (see 3. above). This interrupts
   the download action (it shouldn't matter whether we happen to
   be running curl again or not), and AsyncCancelled bubbles up
   through `fetchPackage`, is caught by the `try` (not fine!) and
   written to the mvar. But no-one is reading that mvar anymore
   because the build job already aborted (step 5), and that
   AsyncCancelled exception is lost.
8. `fetchPackages` keeps doing its thing, exiting eventually.

Note that this change affects both instances of `try` in this story:
`UserInterrupt` is also an asynchronous exception, so after the
change the `UserInterrupt` takes a different path from step 3.
Followup commits fix bugs along those paths.
This fixes deadlock in asyncFetchPackages:

  withAsync action1 (\_ -> action2)

loses exceptions thrown by action1. If in addition, action2
has a dependency on data produced by action1 (as is the case
here), this will block indefinitely.

(This bug caused some of the new tests to hang, since after
the change to use async-safe try, it became easier for
fetchPackages to throw an exception.)
If `withTempFileName` receives an asynchronous exception (e.g.
a canceled async), the bracket cleanup handler will attempt to
remove the temporary file. This can fail with an IO exception.

Regular bracket then throws that IO exception, swallowing the
asynchronous exception. To calling code, this appears no different
from an IO exception thrown from the body, and it won't be able
to tell that it should be exiting promptly.

This manifests concretely during temporary file clean-up of
`asyncFetchPackages` on Windows (seen during unit testing in CI),
where temporary file removal fails (on GHC 8.4), which leads
to `concurrently` failing to cancel the outstanding download
because it's handled like a regular download failure.
@mergify
Copy link
Contributor

mergify bot commented Feb 23, 2022

rebase

✅ Branch has been successfully rebased

@Mikolaj Mikolaj merged commit 001e3cc into haskell:master Feb 23, 2022
@robx robx deleted the fix-interrupt branch August 2, 2022 13:40
@mergify mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

attention: needs-review merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days merge me Tell Mergify Bot to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants