Skip to content
This repository was archived by the owner on Aug 11, 2022. It is now read-only.

Conversation

@Cito
Copy link

@Cito Cito commented Nov 2, 2017

This patch removes all duplicates from the list of failed
installs before the packages in the list are rolled back.
This should speed up the rollback process and also prevent
errors when the asynchronous rollback operations
try to e.g. delete a file that has already been deleted,
yielding errors under Windows.

This should solve issue #17671 and maybe similar ones
like #17747 and #18380.

This patch removes all duplicates from the list of failed
installs before the packages in the list are rolled back.
This should speed up the rollback process and also prevent
errors when the asynchronous rollback operations
try to e.g. delete a file that has already been deleted,
yielding errors under Windows.
@Cito Cito requested a review from a team as a code owner November 2, 2017 11:26
@filipesilva
Copy link

filipesilva commented Nov 2, 2017

@Cito CI is failing due to a style issue:

/home/travis/build/npm/npm/lib/install.js:553:5: Closing curly brace should be on the same line as opening curly brace or on the line after the previous block.

So instead of

  failed = failed.filter(function (pkg, i) {
    return failed.indexOf(pkg) === i })

The linter wants

  failed = failed.filter(function (pkg, i) {
    return failed.indexOf(pkg) === i
  })

The rest of it seems to be passing though. I'm going to add this locally and see how it goes. Cheers for taking this on!

@eminence
Copy link

eminence commented Nov 2, 2017

The root cause of this issue went undiagnosed for a while, suggesting this is a subtle thing. I think a small comment within code could be beneficial.

@Cito
Copy link
Author

Cito commented Nov 2, 2017

Unfortunately I found that removing duplicate packages is sometimes not sufficient to prevent all problematic concurrent access to the file system. It can happen that a package is rolled back while the rollback of a child package is still going on. For instance, I noticed that both fsevents/node-pre-gyp and fsevents were rolled back. The latter failed, probably because the rollback of the former was still going on and locking the directory.

My solution in the last commit is to not roll back child packages of packages that are rolled back anyway. Not completely sure whether this is ok - please review. Also, I'm not sure why CI is failing again.

@filipesilva
Copy link

@Cito as far as I can tell CI is failing because of linting:

standard: Use JavaScript Standard Style (http://standardjs.com) 
  /home/travis/build/npm/npm/lib/install.js:558:34: Missing space before function parentheses.

In line 558, you need to add a space between function and (pkg).

I can't speak for the correctness of your changes though, since I don't really know the npm internals... @zkat perhaps could? I'm not sure who is the most appropriate person to bring up, but this issue affects a lot of Windows users so it would be nice to get it right 😞

@Cito
Copy link
Author

Cito commented Nov 3, 2017

@filipesilva - you're right it was just a missing space.

A different approach of solving this would be to use chain instead of asyncMap when rolling back so that the rollbacks do not happen in parallel any more. But this would slow down the process.

@ChicagoDave
Copy link

This is brilliant Cito. You have just lowered the stress level of zillions of Windows npm users.

@reduckted
Copy link

Here's a repo to test this fix (this is specifically targeted at Windows systems). Clone it and run:

node index.js

It repeatedly calls npm i and tracks the number of failures. Without the changes from this PR, after 200 iterations, the score was:

  • Success: 136 times
  • Failure: 64 times

So about a 68% success rate. 😢

I then patched lib/install.js with the changes from this PR and ran it again:

  • Success: 200 times
  • Failure: 0 times

100% success rate! 🎉

And just to check that it wasn't a coincidence, I rolled back the patch and ran again, and the success rate was around the 60-70% mark.

So I'm happy to declare that this PR definitely fixes #17671. There's some discussion in #17671 that suggests this PR isn't a complete fix, but I believe it's good enough and should be merged as a matter of priority.

ℹ️ I should stress that this bug definitely seems to be dependent on a number of factors. For example, without these changes, on my work PC I get around a 60% success rate, whereas on my home PC it's closer to 100%. Running it from an SSD also results in a higher success rate than from a HDD, so if you're trying to reproduce the original issue, you may need to try it in a few different environments until you find one that results in a decent failure rate.

@pmoleri
Copy link

pmoleri commented Nov 15, 2017

Hi @iarna , I'm pinging you because I saw you merged PR #18458.
That PR didn't solve the root issue, and it seems this one does.

Me and many other Windows users are stuck in old versions, 4.X or 5.0.4 because of this issue.
I hope this can be merged soon.
Thanks

failed = failed.filter(function (pkg, i) {
return failed.indexOf(pkg) === i
})
// also remove child packages since they are rolled back by their parents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actions should be sorted deepest first so strictly speaking this shouldn't be necessary. This would also fail to prune deeply nested failed optionals (eg one's where pkg.parent.parent is failing and pkg is failing but pkg.parent is not).

Copy link
Author

@Cito Cito Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actions are sorted deepest first. But they are executed asynchronously. Therefore the delete operation on a top level package may start while the delete operation on a subpackage is still going on. And this can cause the error we are seeing under Windows.

@iarna
Copy link
Contributor

iarna commented Nov 16, 2017

I'm curious if #19054 solves the problems you were seeing? It takes a bit of a different approach (and also solves the root cause that was getting multiples of the same module in the list to start with).

(Pinging @pmoleri in particular, since you asked…)

@Cito
Copy link
Author

Cito commented Nov 16, 2017

Thank your @iarna. #19054 looks interesting and might be a better solution. As a first step we both are removing the duplicates from the list. This step is uncontroversial, and already greatly reduces the probability to hit the issue. However, I found that this is not sufficient, since as I wrote earlier, the problem can also appear when a package is removed while a subpackage is still being processed. That's why I added the second filter operation. Does your solution also handle this problem? It looks like you're doing something in that regard, but I only had a short look at it so far. Will look more closely later.

@iarna
Copy link
Contributor

iarna commented Nov 16, 2017

@citro The deeper thing shouldn't be necessary as the action list is sorted deepest to shallowest, but in the light of morning, circular dependencies could produce the scenario you're describing.

This is really a bug in rollback which should never effect anything other than the module it's rolling back—I'll address that directly.

iarna added a commit that referenced this pull request Nov 16, 2017
@iarna
Copy link
Contributor

iarna commented Nov 16, 2017

I've also made rollback errors not fail the build, which should substantially improve the user experience even when something goes wrong.

I'm closing in preference to #19054.

@iarna iarna closed this Nov 16, 2017
@Cito
Copy link
Author

Cito commented Nov 16, 2017

@iarna Please note what I have written above: You're right that the list ist sorted from deep to shallow, but this does not help, since the rollback operations happen asynchronously. To be more precise, what happens is this: rollback() calls gentlyRm() which calls vacuum() with purge=true which calls rimraf() which calls rmdir() and unlink(). The latter are asynchronous operations, i.e. npm continues rolling back the next package on the list, while the remove operations are still happening in the background on the filesystem.

In my case, there were no circular dependencies. The failed list contained several subpackages of fsevents, such as fesvents/node_modules/node-pre-gyp as the second-to-last item, and then fsevents itself as the last item. So, while the remove operations on fsevents/node_modules/node-pre-gyp and maybe other subdirectories are still going on in the background, npm tries to delete the containing directory fsevents. This is a race condition that can cause the error we are observing under Windows, depending on how fast the removal of the subpackage files on the harddisk executes. Contrary to Linux, Windows throws an error when you try to remove a file that is currently read or deleted by another process. We need to make sure that #19054 addresses this problem as well.

@Cito
Copy link
Author

Cito commented Nov 16, 2017

@iarna Just found the time to take a deeper look at your PR #19054 and analyzed the problem a bit more. It looks good to me, but as far as I see, it does not address the issue that subdirectories are deleted asynchronously together with their top level directory. The errors happen less frequently, since you removed the duplicates and maybe also because you changed the timing a bit by using promises instead of callbacks in some places. But it still happens, the only difference is that you converted the error to a warning ("Rolling back ... failed (this is probably harmless)"). I was able to reproduce this warning message sporadically with your branch, repeating the same operation ("npm init") several times, which shows that the race condition still exists.

Actually I believe we would even see a lot more of these warnings under Windows - the reason why it happens so infrequently is that the rimraf module takes some efforts to intercept and deal with errors to make remove operations more fault tolerant like under Linux. But it seems rimraf is not perfect and some errors can still slip through under rare conditions, if you do a lot of parallel, overlapping remove operations. So on the one hand, rimraf has shielded us from these errors, but on the other hand, it made it harder to see that there is a problem.

@Cito
Copy link
Author

Cito commented Nov 17, 2017

I've digged into this rat hole some more. Actually, there are two kinds of remove operations going on during the rollback: First, the top-down remove operations of the package and its subpackages. But second, vacuum is also removing empty directories bottom-up. It seems the problem is not so much the first kind of operation, but the second one. This is because the first kind uses rimraf, which, as I wrote above, shields us against this Windows EPERM issue using a method called fixWinEPERM internally. But the second kind of operations inside vacuum are not shielded against EPERM. I was able to reproduce the problem in an isolated test case in npm/fs-vacuum#8.

So I see 3 ways of fixing this:

  1. Eliminating the nested rollbacks completely (as I am doing in this PR) to avoid the concurrent operations. This works as long the rollbacks consist only of the standard finalize rollback operation on the filesystem. However, the rollback mechanism might theoretically be used for other operations as well, which should not be skipped for subpackages.
  2. Improve fs-vacuum, so that it does not throw EPERM under Windows in concurrent, overlapping vacuum operations.
  3. Suppress the EPERM error in action/finalize if running under Windows (Stop dropping failed optional deps #19054 already converted it into a warning, but suppressing it completely might be less confusing).

@toebens
Copy link

toebens commented Nov 23, 2017

please reopen and merge this PR

@altsanz
Copy link

altsanz commented Jan 25, 2018

i'm pumping this, as seems a valid solution for Windows NPM nightmare scenario.

@alexeygt
Copy link

alexeygt commented Feb 8, 2018

Please reopen and merge this PR - it contains the only correct solution for EPERM problem on Windows.

npm v5.6.0 still throws same EPERM error:

...
npm info linkStuff [email protected]
npm ERR! path D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347
npm ERR! code EPERM
npm ERR! errno -4048
npm ERR! syscall rename
npm ERR! Error: EPERM: operation not permitted, rename 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347' -> 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js'
npm ERR! { Error: EPERM: operation not permitted, rename 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347' -> 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js'
npm ERR! cause:
npm ERR! { Error: EPERM: operation not permitted, rename 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347' -> 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js'
npm ERR! errno: -4048,
npm ERR! code: 'EPERM',
npm ERR! syscall: 'rename',
npm ERR! path: 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347',
npm ERR! dest: 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js' },
npm ERR! stack: 'Error: EPERM: operation not permitted, rename 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347' -> 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js'',
npm ERR! errno: -4048,
npm ERR! code: 'EPERM',
npm ERR! syscall: 'rename',
npm ERR! path: 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js.1796620347',
npm ERR! dest: 'D:\npm-projects\fsm\FlashSetManager.FrontEnd\node_modules\requirejs\bin\r.js',
npm ERR! parent: 'flash-set-manager' }
npm ERR!
npm ERR! Please try running this command again as root/Administrator.

npm ERR! A complete log of this run can be found in:
npm ERR! D:\npm-projects\npm-cache_logs\2018-02-08T19_39_35_541Z-debug.log

babichea@USPLMWN116020 MINGW64 /d/npm-projects/fsm/FlashSetManager.FrontEnd (tokens-login-logout)

$ npm -v
5.6.0

@MikeDimmickMnetics
Copy link

I was going to leave this comment on #17671 but it's closed to non-contributors.

You can delete open files on Windows. It requires that all open file handles have opened it with the FILE_SHARE_DELETE flag to CreateFile. This is not included in the normal flags that the Microsoft C runtime passes in _open, so few programs permit it. DeleteFile immediately deletes the actual file as long as all users have marked it shareable (it opens it with the FILE_FLAG_DELETE_ON_CLOSE flag and immediately closes its handle). However, the directory entry remains until the last file object referencing the file is closed (you can have multiple handles to each object, but each call to CreateFile creates a new file object), so the directory entry hangs around until all programs have closed all their handles to the file.

Raymond Chen wrote about this behaviour at The FILE_FLAG_DELETE_ON_CLOSE flag applies to the handle, also known as the file object, which is not the same as the file.

I'm not sure if that helps with fixing the npm issue, but it may well explain some of the odd behaviour observed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.