Skip to content

[staging-next] coreutils: Disable SEEK_HOLE due to corruption#143097

Merged
vcunat merged 1 commit intoNixOS:staging-nextfrom
helsinki-systems:fix/coreutils-corruption
Oct 28, 2021
Merged

[staging-next] coreutils: Disable SEEK_HOLE due to corruption#143097
vcunat merged 1 commit intoNixOS:staging-nextfrom
helsinki-systems:fix/coreutils-corruption

Conversation

@dasJ
Copy link
Member

@dasJ dasJ commented Oct 27, 2021

Motivation for this change
coreutils: Disable SEEK_HOLE due to corruption

See https://github.com/openzfs/zfs/issues/11900 as an example. This only
happens on Coreutils 9.0. Reported here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=51433
Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • 21.11 Release Notes (or backporting 21.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@ofborg ofborg bot added the 6.topic: darwin Running or building packages on Darwin label Oct 27, 2021
Copy link
Member

@vcunat vcunat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: over longer term it would be good to add some explanation around the patch, but it's possible that we'll soon have some update thanks to upstream.

@dasJ
Copy link
Member Author

dasJ commented Oct 27, 2021

I will amend the commit with the bug id in the coreutils bugtracker once the bug is opened

@ofborg ofborg bot added 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. 10.rebuild-linux-stdenv This PR causes stdenv to rebuild on Linux and must target a staging branch. labels Oct 27, 2021
@ofborg ofborg bot requested review from edolstra, lovek323 and np October 27, 2021 09:50
@ofborg ofborg bot added 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-darwin: 5001+ This PR causes many rebuilds on Darwin and must target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. labels Oct 27, 2021
@happysalada
Copy link
Contributor

I'm happy to run a darwin build tonight to check if the issue is resolved.

@vcunat
Copy link
Member

vcunat commented Oct 27, 2021

On chat I understood that @r-burns did that (bisecting coreutils and building jx), but confirmation from elsewhere would be nice. Especially as Hydra doesn't progress fast on this and it can't even cancel whole evaluations at this moment :-/

@dasJ dasJ force-pushed the fix/coreutils-corruption branch 2 times, most recently from 9f833dd to 538fcee Compare October 27, 2021 12:27
@dasJ dasJ changed the title [staging-next] coreutils: Disable SEEK_HOLE due to corruption and switch Darwin to 9.0 [staging-next] coreutils: Disable SEEK_HOLE due to corruption Oct 27, 2021
@ofborg ofborg bot added 11.by: package-maintainer This PR was created by a maintainer of all the package it changes. 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. and removed 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-darwin: 5001+ This PR causes many rebuilds on Darwin and must target the staging branches. labels Oct 27, 2021
@dasJ dasJ removed the 6.topic: darwin Running or building packages on Darwin label Oct 27, 2021
@pixelb
Copy link

pixelb commented Oct 27, 2021

@happysalada are you using zfs on darwin by any chance?
We know zfs has issues with SEEK_HOLE at least.
If not zfs what file system is the copy being done from. Thanks

@mohe2015
Copy link
Contributor

mohe2015 commented Oct 27, 2021

@happysalada are you using zfs on darwin by any chance? We know zfs has issues with SEEK_HOLE at least. If not zfs what file system is the copy being done from. Thanks

To my knowledge we could also reproduce (consistently even) with tmpfs and ext4. Edit: On Linux

@vcunat
Copy link
Member

vcunat commented Oct 27, 2021

Well, for now we need this PR as mitigation, regardless of whose bug it is or how exactly it happens.

@happysalada
Copy link
Contributor

Not using zfs, just a regular darwin systemd with apfs.
I've started that "quick" jx build (8 hours ago), I should be done tomorrow morning. Almost there.

@jonringer
Copy link
Contributor

jonringer commented Oct 27, 2021

I was wondering why my build server was having is issues:

sudo zpool status -v nixstore
  pool: nixstore
...
errors: Permanent errors have been detected in the following files:

        /nix/store/.links
        
$ ls /nix/store/.links
"/nix/store/.links": Invalid exchange (os error 52)

I've disabled auto-optimise for now. But if I want to re-enable it I would have to destroy that pool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't easily undef SEEK_HOLE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could instead undef/redef it somewhere under all includes. (The risk of includes inside the file should be low.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? It will lead to the exact same result but require me to redo the patch, amend the commit, and retest it for no benefit at all

@dasJ dasJ force-pushed the fix/coreutils-corruption branch from 538fcee to da8fb80 Compare October 27, 2021 18:36
@dasJ dasJ force-pushed the fix/coreutils-corruption branch from da8fb80 to 924ccbf Compare October 27, 2021 18:54
@pixelb
Copy link

pixelb commented Oct 27, 2021

To my knowledge we could also reproduce (consistently even) with tmpfs and ext4. Edit: On Linux

That would be very surprising. A minimal reproducer would be exceptionally useful

@eggert
Copy link

eggert commented Oct 28, 2021

My guess is that coreutils is the victim of a kernel or filesystem bug here. If so, that bug really needs fixing as other programs use lseek+SEEK_DATA too. See my comment in GNU coreutils Bug#51433.

@happysalada
Copy link
Contributor

Ok, jx and alacritty build fine on this PR, all is good with darwin!
Happy to merge whenever (unless you still want more feedback).
I would say let's merge now and get the staging-next cycle merged as well.

@mohe2015
Copy link
Contributor

To my knowledge we could also reproduce (consistently even) with tmpfs and ext4. Edit: On Linux

That would be very surprising. A minimal reproducer would be exceptionally useful

After further thought and investigation with the help of others I strongly believe this to be a misattribution. Identifying the root cause in the case of peertube was difficult because of the unclear error message and strange reproducibility. I believe we downloaded dependencies from peertube from the so-called binary cache (which likely has zfs-backed builders) one of which was esbuild the actual broken derivation. So this issue is pretty surely not actually on ext4/tmpfs. I'm sorry for causing confusion.

@pixelb
Copy link

pixelb commented Oct 28, 2021

@mohe2015 Thanks for all the info and testing. We'll focus on handling ZFS in upstream coreutils

@vcunat vcunat merged commit 3f4420f into NixOS:staging-next Oct 28, 2021
@dasJ dasJ deleted the fix/coreutils-corruption branch October 28, 2021 18:42
@vcunat
Copy link
Member

vcunat commented Oct 29, 2021

Apparently this PR broke one test on coreutils-prefixed + aarch64-linux.

FAIL: tests/misc/cut-huge-range
===============================

FAIL tests/misc/cut-huge-range.sh (exit status: 1)

I reproduced the issue on the community box and also retried the two unprefixed builds and x86_64-linux builds, and all seems reproducible with failure only in this single combination. So far I have no idea why it happens and why --program-prefix=g apparently makes a difference here.

@mohe2015
Copy link
Contributor

Without looking at the code I assume that test would also need to be skipped like the other test.

@dasJ
Copy link
Member Author

dasJ commented Oct 29, 2021

Currently testing around this issue but it takes a lot of time because my aarch64 hardware isn't the fastest

@mohe2015
Copy link
Contributor

Can look at the code right now but maybe it's reproducible even on a slow HDD on x86

@dasJ
Copy link
Member Author

dasJ commented Oct 29, 2021

It built for me on aarch64-linux. So it's probably just a flakey test. Should I disable the test or should we just restart the build on Hydra, @vcunat ?

@vcunat
Copy link
Member

vcunat commented Oct 29, 2021

@vcunat
Copy link
Member

vcunat commented Oct 29, 2021

Either way, the coreutils-prefixed derivation doesn't seem important, so I don't consider it such a hot issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. 10.rebuild-linux-stdenv This PR causes stdenv to rebuild on Linux and must target a staging branch. 11.by: package-maintainer This PR was created by a maintainer of all the package it changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants