-
Notifications
You must be signed in to change notification settings - Fork 3k
Rollback failed optional installs only once #19042
Conversation
This patch removes all duplicates from the list of failed installs before the packages in the list are rolled back. This should speed up the rollback process and also prevent errors when the asynchronous rollback operations try to e.g. delete a file that has already been deleted, yielding errors under Windows.
|
@Cito CI is failing due to a style issue:
So instead of The linter wants The rest of it seems to be passing though. I'm going to add this locally and see how it goes. Cheers for taking this on! |
|
The root cause of this issue went undiagnosed for a while, suggesting this is a subtle thing. I think a small comment within code could be beneficial. |
|
Unfortunately I found that removing duplicate packages is sometimes not sufficient to prevent all problematic concurrent access to the file system. It can happen that a package is rolled back while the rollback of a child package is still going on. For instance, I noticed that both fsevents/node-pre-gyp and fsevents were rolled back. The latter failed, probably because the rollback of the former was still going on and locking the directory. My solution in the last commit is to not roll back child packages of packages that are rolled back anyway. Not completely sure whether this is ok - please review. Also, I'm not sure why CI is failing again. |
|
@Cito as far as I can tell CI is failing because of linting: In line 558, you need to add a space between I can't speak for the correctness of your changes though, since I don't really know the npm internals... @zkat perhaps could? I'm not sure who is the most appropriate person to bring up, but this issue affects a lot of Windows users so it would be nice to get it right 😞 |
|
@filipesilva - you're right it was just a missing space. A different approach of solving this would be to use chain instead of asyncMap when rolling back so that the rollbacks do not happen in parallel any more. But this would slow down the process. |
|
This is brilliant Cito. You have just lowered the stress level of zillions of Windows npm users. |
|
Here's a repo to test this fix (this is specifically targeted at Windows systems). Clone it and run: It repeatedly calls
So about a 68% success rate. 😢 I then patched
100% success rate! 🎉 And just to check that it wasn't a coincidence, I rolled back the patch and ran again, and the success rate was around the 60-70% mark. So I'm happy to declare that this PR definitely fixes #17671. There's some discussion in #17671 that suggests this PR isn't a complete fix, but I believe it's good enough and should be merged as a matter of priority. ℹ️ I should stress that this bug definitely seems to be dependent on a number of factors. For example, without these changes, on my work PC I get around a 60% success rate, whereas on my home PC it's closer to 100%. Running it from an SSD also results in a higher success rate than from a HDD, so if you're trying to reproduce the original issue, you may need to try it in a few different environments until you find one that results in a decent failure rate. |
| failed = failed.filter(function (pkg, i) { | ||
| return failed.indexOf(pkg) === i | ||
| }) | ||
| // also remove child packages since they are rolled back by their parents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actions should be sorted deepest first so strictly speaking this shouldn't be necessary. This would also fail to prune deeply nested failed optionals (eg one's where pkg.parent.parent is failing and pkg is failing but pkg.parent is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actions are sorted deepest first. But they are executed asynchronously. Therefore the delete operation on a top level package may start while the delete operation on a subpackage is still going on. And this can cause the error we are seeing under Windows.
|
Thank your @iarna. #19054 looks interesting and might be a better solution. As a first step we both are removing the duplicates from the list. This step is uncontroversial, and already greatly reduces the probability to hit the issue. However, I found that this is not sufficient, since as I wrote earlier, the problem can also appear when a package is removed while a subpackage is still being processed. That's why I added the second filter operation. Does your solution also handle this problem? It looks like you're doing something in that regard, but I only had a short look at it so far. Will look more closely later. |
|
@citro The deeper thing shouldn't be necessary as the action list is sorted deepest to shallowest, but in the light of morning, circular dependencies could produce the scenario you're describing. This is really a bug in |
|
I've also made rollback errors not fail the build, which should substantially improve the user experience even when something goes wrong. I'm closing in preference to #19054. |
|
@iarna Please note what I have written above: You're right that the list ist sorted from deep to shallow, but this does not help, since the rollback operations happen asynchronously. To be more precise, what happens is this: In my case, there were no circular dependencies. The |
|
@iarna Just found the time to take a deeper look at your PR #19054 and analyzed the problem a bit more. It looks good to me, but as far as I see, it does not address the issue that subdirectories are deleted asynchronously together with their top level directory. The errors happen less frequently, since you removed the duplicates and maybe also because you changed the timing a bit by using promises instead of callbacks in some places. But it still happens, the only difference is that you converted the error to a warning ("Rolling back ... failed (this is probably harmless)"). I was able to reproduce this warning message sporadically with your branch, repeating the same operation ("npm init") several times, which shows that the race condition still exists. Actually I believe we would even see a lot more of these warnings under Windows - the reason why it happens so infrequently is that the |
|
I've digged into this rat hole some more. Actually, there are two kinds of remove operations going on during the rollback: First, the top-down remove operations of the package and its subpackages. But second, So I see 3 ways of fixing this:
|
|
please reopen and merge this PR |
|
i'm pumping this, as seems a valid solution for Windows NPM nightmare scenario. |
|
Please reopen and merge this PR - it contains the only correct solution for EPERM problem on Windows. npm v5.6.0 still throws same EPERM error: ... npm ERR! A complete log of this run can be found in: babichea@USPLMWN116020 MINGW64 /d/npm-projects/fsm/FlashSetManager.FrontEnd (tokens-login-logout) $ npm -v |
|
I was going to leave this comment on #17671 but it's closed to non-contributors. You can delete open files on Windows. It requires that all open file handles have opened it with the Raymond Chen wrote about this behaviour at The FILE_FLAG_DELETE_ON_CLOSE flag applies to the handle, also known as the file object, which is not the same as the file. I'm not sure if that helps with fixing the |
This patch removes all duplicates from the list of failed
installs before the packages in the list are rolled back.
This should speed up the rollback process and also prevent
errors when the asynchronous rollback operations
try to e.g. delete a file that has already been deleted,
yielding errors under Windows.
This should solve issue #17671 and maybe similar ones
like #17747 and #18380.