Skip to content

coreutils: Disable sparse copy on macOS#11960

Closed
httpstorm wants to merge 1 commit intoopenwrt:masterfrom
httpstorm:coreutils-9.1
Closed

coreutils: Disable sparse copy on macOS#11960
httpstorm wants to merge 1 commit intoopenwrt:masterfrom
httpstorm:coreutils-9.1

Conversation

@httpstorm
Copy link
Contributor

@httpstorm httpstorm commented Feb 7, 2023

@hauke @graysky2 @neheb @Ansuel @nbd168
Fix build errors in toolchain/musl and toolchain/kernel-headers introduced in c560822 tools/coreutils: update to 9.1

Due to a bug in macOS, sparse copies are corrupted on virtual disks formatted with APFS. HFS is not affected.
Affected are coreutils install, and gcp when compiled with SEEK_HOLE, as well as macOS Finder.
When the compiler is installed, a corrupted copy is created, which cannot be executed, so the build fails.

While reading the entire file returns valid data, scanning for allocated segments may return holes where valid data is present. In this case a sparse copy does not contain these segments and return zeroes instead. Once the virtual disk is dismounted and then mounted again, a sparse copy produces correct results.

I tried contacting Apple and the coreutils team in 2021 to no avail. What would be the best way to make them aware?

Tested on macOS 12.6.3 x64, targets WRT3200ACM, x64

Steps to reproduce, config and build log: https://httpstorm.com/share/.openwrt/test/2023-02-06_coreutils-9.1/

  • open Disk Utility
  • press Command-N to create a new virtual disk
  • Image Format: sparse disk image
  • save as: ~/vhd/test
  • Name: test
  • Size: 50 GB
  • Format: APFS (Case-sensitive)
  • Encryption: none
  • Partitions: Single partition - GUID Partition Map
  • Save
  • clone and build OpenWRT

Build log

make[3] -C toolchain/musl compile
    ERROR: toolchain/musl failed to build.
    ERROR: toolchain/kernel-headers failed to build.

make toolchain/kernel-headers/{clean,compile} V=sc
...
gcc -L/Volumes/test/b/openwrt/staging_dir/host/lib -o scripts/kconfig/conf scripts/kconfig/conf.o scripts/kconfig/confdata.o scripts/kconfig/expr.o scripts/kconfig/lexer.lex.o scripts/kconfig/menu.o scripts/kconfig/parser.tab.o scripts/kconfig/preprocess.o scripts/kconfig/symbol.o scripts/kconfig/util.o   
scripts/kconfig/conf  --oldconfig Kconfig
arm-openwrt-linux-muslgnueabi-gcc: unknown compiler
scripts/Kconfig.include:44: Sorry, this compiler is not supported.
make[4]: *** [scripts/kconfig/Makefile:77: oldconfig] Error 1
make[3]: *** [Makefile:621: oldconfig] Error 2
make[3]: Leaving directory '/Volumes/test/b/openwrt/build_dir/toolchain-arm_cortex-a9+vfpv3-d16_gcc-12.2.0_musl_eabi/linux-5.15.91'
yes: stdout: Broken pipe
make[2]: *** [Makefile:114: /Volumes/test/b/openwrt/build_dir/toolchain-arm_cortex-a9+vfpv3-d16_gcc-12.2.0_musl_eabi/linux-5.15.91/.configured] Error 2
make[2]: Leaving directory '/Volumes/test/b/openwrt/toolchain/kernel-headers'
time: toolchain/kernel-headers/compile#11.74#14.76#24.36
    ERROR: toolchain/kernel-headers failed to build.
make[1]: *** [toolchain/Makefile:97: toolchain/kernel-headers/compile] Error 1
make[1]: Leaving directory '/Volumes/test/b/openwrt'
make: *** [/Volumes/test/b/openwrt/include/toplevel.mk:231: toolchain/kernel-headers/compile] Error 2

httpstorm referenced this pull request Feb 7, 2023
In addition to version update, this commit applies a fixup to allow building
on MacOS involving renaming: [gt_TYPE_WINT_T] --> [gt_TYPE_WINT_T_GNUTLS]
suggested by zhanhb.

Build system: x86_64
Build-tested: bcm2711/RPi4B

Signed-off-by: John Audia <[email protected]>
@github-actions github-actions bot added the build/scripts/tools pull request/issues for build, scripts and tools related changes label Feb 7, 2023
@httpstorm
Copy link
Contributor Author

httpstorm commented Feb 8, 2023

Overview
A sparse file copy ➡️ within a sparse disk image formatted with APFS, ➡️ stored on an APFS volume on physical media gets corrupted when the right conditions are met. The OpenWRT build system is guaranteed to trigger this every time we run:
make tools/compile -j 16
make toolchain/gcc/initial/{clean,compile} -j 16.
Then make toolchain/{clean,compile} fails because we don't have a valid compiler.

Details

  • First the build system compiles toolchain/gcc/initial, I can confirm that it works (in the build folder).
  • These binaries contain segments that are allocated and segments that are not (they read as zeroes).
  • Next it gets installed by coreutils/install, which performs a sparse copy since version 9.
  • The installed copy does not work.
  • So either some allocated segments are not reported by APFS as allocated and hence they are skipped.
  • Or the code in coretils which identifies them has issues.
  • We need to make coreutils and apple aware.

Note
A sparse copy is more efficient, since it only copies data that is allocated.
With APFS there is also copy on write. If you copy file A of size 1 GB to B, you still use 1 GB for both.
That's because both share the same disk area for the parts that have not changed since the copy.

I'm curious what conditions trigger this?

httpstorm referenced this pull request Feb 8, 2023
Kconfig docs say:
> The default value deliberately defaults to 'n' in order to avoid
> bloating the build.

Apply this rule everywhere, to avoid more cloning of bad examples

Signed-off-by: Tony Butler <[email protected]>
@Ansuel
Copy link
Member

Ansuel commented Feb 9, 2023

@httpstorm i would accept this patch as a workaround BUT we really need a way to tell coreutils that there is this kind of problem... To me it really seems a bug in apple virtual disk... a bug in compacting the virtual disk and something is dropped for sparse files?

A good idea would be create a repro env so coreutils guys can quickly repro this. (a good argument to gain some traction may be that we recently switched to new version on openwrt and this started to get broken and is caused by the new sparse feature)

httpstorm added a commit to httpstorm/coreutils that referenced this pull request Feb 9, 2023
Due to a bug in macOS, sparse copies are corrupted on virtual disks
formatted with APFS. HFS is not affected. Affected are coreutils
install, and gcp when compiled with SEEK_HOLE, as well as macOS Finder.

While reading the entire file returns valid data, scanning for
allocated segments may return holes where valid data is present.
In this case a sparse copy does not contain these segments and return
zeroes instead. Once the virtual disk is dismounted and then
mounted again, a sparse copy produces correct results.

This breaks OpenWRT build on macOS. Details:
openwrt/openwrt#11960

Signed-off-by: Georgi Valkov <[email protected]>
@httpstorm
Copy link
Contributor Author

httpstorm commented Feb 9, 2023

@Ansuel
I agree. Any idea how to reach the Apple developers?
I mailed the patch and details to the coreutils team.
Let's wait for a week and see if they are willing to accept it as a workaround.
If not, we can add it to OpenWRT.

I also created a pull request, but it seems they don't like using Github.
Is there an easy way to contribute or report bugs to them?
coreutils/coreutils#68

Creating a repro environment
It would be very helpful if we can invite someone from gcc to create PoC code.
Their build environment triggers it every time.

steps to reproduce

Create a sparse disk image with APFS

  • open Disk Utility
  • press Command-N to create a new virtual disk
  • Image Format: sparse disk image
  • save as: ~/vhd/test
  • Name: test
  • Size: 50 GB
  • Format: APFS (Case-sensitive)
  • Encryption: none
  • Partitions: Single partition - GUID Partition Map
  • Save
  • the image is mounted under /Volumes/test

clone and build coreutils

cd /Volumes/test
# patched with sparse support disabled on macOS
git clone https://github.com/httpstorm/coreutils.git
mv coreutils coreutils-no-sparse
cd coreutils-no-sparse
git checkout macos_disable-sparse-copy
git submodule foreach git pull origin master
./bootstrap
./configure
make -j 16
cd ..

# original code
git clone https://github.com/coreutils/coreutils.git
cd coreutils
git submodule foreach git pull origin master
./bootstrap
./configure
make -j 16

download OpenWRT and build gcc

cd /Volumes/test
git clone https://github.com/openwrt/openwrt.git
cd openwrt
./scripts/feeds update -a
./scripts/feeds install -a
make defconfig
make menuconfig
# Target System (Marvell EBU Armada)
# Target Profile (Linksys WRT3200ACM)
# Save, Exit
make tools/compile -j 16
make toolchain/gcc/initial/compile -j 16

proof of concept

cd /Volumes/test/openwrt
cd build_dir/toolchain-arm_cortex-a9+vfpv3-d16_gcc-12.2.0_musl_eabi/gcc-12.2.0-initial/gcc

/Volumes/test/coreutils/src/cp cc1 cc1-ori
./cc1-ori --help
-bash: ./cc1-ori: cannot execute binary file: Exec format error

/Volumes/test/coreutils-no-sparse/src/cp cc1 cc1-fix
./cc1-fix --help
# works correctly

sha1sum cc1
# fcf5d9458f54d5588efc5a76a89388925ca04eda  cc1
sha1sum cc1-fix 
# fcf5d9458f54d5588efc5a76a89388925ca04eda  cc1-fix
sha1sum cc1-ori
# 7b447132f56f3b549ef62a4780945e82d26f7d2c  cc1-ori

@httpstorm
Copy link
Contributor Author

Good news, I got a replay on the coretils mailing list.
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386

We'll try to test for a better solution and fall-back to accepting the patch.

@Ansuel
Copy link
Member

Ansuel commented Feb 9, 2023

@httpstorm nice! then lets wait

@httpstorm
Copy link
Contributor Author

httpstorm commented Feb 10, 2023

We implemented a solution: for copies on the same filesystem, fclonefileat is used, which clones a file CoW. The clone is instant even for large files and uses the same disk sectors so no additional space is allocated. For copies from one filesystem to another, we fall-back to regular copy. On macOS sparse copy is disabled, since the API is broken. I will post an update when a new coreutils release is available.

edit: This link explains why sparse copies were corrupted.
https://lists.gnu.org/archive/html/bug-gnulib/2018-09/msg00054.html

@Ansuel
Copy link
Member

Ansuel commented Feb 10, 2023

@httpstorm is the patch got merged in coreutils repo? If it's the case then lets just backport the fix to the current release... It's totally ok to backport merged patch... not ok to add workaround without approval from the related repo.

@httpstorm
Copy link
Contributor Author

httpstorm commented Feb 10, 2023

@Ansuel No, it's not merged yet, and I expect two patches: the one in this pull request, and another to enable CoW. But let's wait until they confirm everything is merged. I posted earlier to keep you updated. I just tested the suggested patch and confirmed everything is ok. You can find our conversation and patch details on the mailing list. Link in a previous comment.

@httpstorm
Copy link
Contributor Author

I found the underlaying issue: one way of creating sparse files is to map a file to memory view using mmap. Then we write to the memory view. Finally we munmap the view and close the file. This results in a valid file, however because we did not explicitly call msync, the filesystem is not fully updated. It will get updated eventually. Specifically the list of valid pages cached by the filesystem is old. As a result lseek(SEEK_DATA) does not return all of the valid pages, and install/cp create corrupted copies when sparse support is enabled (default). More info and PoC code here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386#158

I am still working with the coreutil team. We wanted to make sure we know the reason for the issue, and that compliance is maintained.

@mcprat
Copy link
Contributor

mcprat commented Feb 24, 2023

(me being very picky) please wrap the link with the commit tag Link:

Link: https:// .....
Signed-off-by: .....

has anyone looked into whether coreutils can be configured this way instead of patching?

@httpstorm
Copy link
Contributor Author

has anyone looked into whether coreutils can be configured this way instead of patching?

@mpratt14 I'm still working with the coreutils team, you can check our discussion:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386

They recently committed a few patches to Savannah, that address the issue, however that broke some of the tests. They added more patches yesterday but now the tests hang, so I just sent them a new report. Hopefully they will address it soon. The fix is way more sophisticated than my PR, so I will close it when everything settles, and we'll update to a new version of coreutils. I will keep you updated.
Apple is also aware of the issue and have reproduced it.

Due to a bug in macOS, sparse copies are corrupted on virtual disks
formatted with APFS. HFS is not affected. Affected are coreutils
install, and gcp when compiled with SEEK_HOLE, as well as macOS Finder.

While reading the entire file returns valid data, scanning for
allocated segments may return holes where valid data is present.
In this case a sparse copy does not contain these segments and return
zeroes instead. Once the virtual disk is dismounted and then
mounted again, a sparse copy produces correct results.

This breaks OpenWRT build on macOS. Steps to reproduce:
https://httpstorm.com/share/.openwrt/test/2023-02-06_coreutils-9.1/

Coreutils bug report and discussion
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386

Signed-off-by: Georgi Valkov <[email protected]>
@httpstorm
Copy link
Contributor Author

coreutils-9.2 should be released in about 1.5 weeks. I'll make a PR when it is ready.

@Ansuel
Copy link
Member

Ansuel commented Mar 10, 2023

@httpstorm news with this?

@httpstorm
Copy link
Contributor Author

httpstorm commented Mar 11, 2023

@Ansuel
The fix is already on Savannah master, but coreutils is still on v9.1. v9.2 has not been released yet. They were working on split and tee in the past days. I asked for an update.
edit: I just received a replay from coreutils

Best guess is 9.2 release is 5 days away.

@Ansuel
Copy link
Member

Ansuel commented Mar 14, 2023

@httpstorm btw if we notice problem with 9.2 backporting the patch is a way (just for reference)

@Ansuel Ansuel self-assigned this Mar 14, 2023
@pixelb
Copy link

pixelb commented Mar 20, 2023

Coreutils 9.2 is now released

@httpstorm
Copy link
Contributor Author

@pixelb Thank you very much for all your work, Pádraig! I will send you the test results later. So far everything works well. I also installed and tested coreutils-9.2 from brew.
@Ansuel I am preparing to build OpenWRT in a fresh directory for the following targets: WRT3200ACM, x64 and 1043nd. I will come back in a few hours with the results and create a PR.

@httpstorm
Copy link
Contributor Author

httpstorm commented Mar 21, 2023

@pixelb @Ansuel
I need help. Updating to coreutils-9.2 is easy, and it compiles correctly. However the fix in gnulib also modifies m4 and there is no new release. The latest version is m4-1.4.19 from 2021-05-28 17:55. I would assume that we can clone gnulib, checkout f17d397771164c1b0f77fea8fb0abdc99cf4a3e1 or master dd723a3ed53cc3b969c6abdf7b0fb6ea8339079a, and build m4 from there. However I'm not sure how to configure tools/m4/Makefile to achieve this. We can also follow the steps for building coreutils without relying on tools/m4. The build process automatically downloads and builds gnulib and m4:

git clone git://git.savannah.gnu.org/coreutils.git
cd coreutils
./bootstrap
./configure
make -j 16

Once we update gnulib/m4, the following patch will no longer be needed:
001-m4.patch

I created a draft PR to update to coreutils-9.2. We can continue the discussion here: #12233

httpstorm added a commit to httpstorm/openwrt that referenced this pull request Mar 23, 2023
This resolves an error when building toolchain/musl on macOS due to
improper hole-detection caused by a bug in macOS/APFS:
openwrt#11960

As long as we don't reconfigure, 001-m4.patch is not needed.
If we keep it, it will force reconfigure the project,
since m4 files are changed. This works, but may not be optimal,
because the build should use files from coreutils/m4,
but OpenWRT uses legacy files from staging_dir/host/share/aclocal.
openwrt#12233 (comment)

Coreutils bug report and discussion
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386

Signed-off-by: Georgi Valkov <[email protected]>
httpstorm added a commit to httpstorm/openwrt that referenced this pull request Apr 1, 2023
This resolves an error when building toolchain/musl on macOS due to
improper hole-detection caused by a bug in macOS/APFS:
openwrt#11960

As long as we don't reconfigure, 001-m4.patch is not needed.
If we keep it, it will force reconfigure the project,
since m4 files are changed. This works, but may not be optimal,
because the build should use files from coreutils/m4,
but OpenWRT uses legacy files from staging_dir/host/share/aclocal.
openwrt#12233 (comment)

Coreutils bug report and discussion
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386

backport a couple of upstream patches as requested by mpratt14
date: diagnose -f read errors
copy: fix --reflink=auto to fallback in more cases

Signed-off-by: Georgi Valkov <[email protected]>
httpstorm added a commit to httpstorm/openwrt that referenced this pull request Apr 1, 2023
This resolves an error when building toolchain/musl on macOS due to
improper hole-detection caused by a bug in macOS/APFS:
openwrt#11960

As long as we don't reconfigure, 001-m4.patch is not needed.
If we keep it, it will force reconfigure the project,
since m4 files are changed. This works, but may not be optimal,
because the build should use files from coreutils/m4,
but OpenWRT uses legacy files from staging_dir/host/share/aclocal.
openwrt#12233 (comment)

Coreutils bug report and discussion
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=61386

backport a couple of upstream patches as requested by mpratt14
date: diagnose -f read errors
copy: fix --reflink=auto to fallback in more cases

Signed-off-by: Georgi Valkov <[email protected]>
@hauke
Copy link
Member

hauke commented Apr 2, 2023

This was fixed by updating coreutils to version 9.2, see #12233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build/scripts/tools pull request/issues for build, scripts and tools related changes ci:target:all

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants