TarWriter: when writing entries for the same hard-linked file use HardLink entries.#123874
TarWriter: when writing entries for the same hard-linked file use HardLink entries.#123874tmds wants to merge 7 commits intodotnet:mainfrom
Conversation
|
Note that extracting an archive with hard links to a file system that doesn't support them will fail. With GNU tar, you can specify the |
src/libraries/System.Formats.Tar/tests/TarFile/TarFile.ExtractToDirectory.File.Tests.Unix.cs
Outdated
Show resolved
Hide resolved
0eedbe9 to
698d3ae
Compare
src/libraries/System.Formats.Tar/tests/TarFile/TarFile.ExtractToDirectory.File.Tests.cs
Outdated
Show resolved
Hide resolved
|
For #123874 (comment), we can make this behavior controllable by the user. For backwards compatibility we might want to keep dereferencing on tar creation by default. @bartonjs @karelz this would be another use of an options class for tar (#69780). These options are for the reader/writer. TarWriterOptions/TarReaderOptions might be more interesting than TarExtractOptions/TarCreateOptions. |
|
I would like to make the new behavior default (to be more like the GNU Tar many people are familiar with), but I agree with the backwards compatibility concerns. Should we hide it for now behind an appctx switch until we add the new API surface? @ericstj any thoughts? |
For GNU tar the behavior makes sense due to these expectations: hard links are not common, file systems are likely to support them. We could make the same assumptions for .NET.
So, the switch would be for opting in to the previous behavior which dereferences hard links on tar creation. |
|
One of the challenges with making it breaking by default is that user's might not discover they are broken right away. It's only later when the file is extracted that folks discover this.
AppContext switches are for "legacy" behavior where we know folks are broken. In this case I think an API is more appropriate since some people may prefer to produce a more portable archive. In addition to file-systems that don't support this, I think it's also a challenge for non-seek-able streams. I'd prefer us adding an API for it, like Gnu TAR has, changing the default, and only if folks complain about being broken in-place do we introduce an AppContext switch to rollback. GNU Tar doing this by default is a strong signal for us to do the same.
Do we expose enough information from the archive entries such that someone could implement extraction which falls-back to copy? Is the exception we throw in this case recoverable such that someone could fallback to doing a copy and continuing extraction? Similar case for compression- we should have API that lets the user choose to either represent the links or duplicate. It would be good to enumerate all the scenarios that might be broken and how one could workaround those on the consumption side. Those can be protected in test cases, and also included in a breaking change doc. |
I don't think this is straight forward and I don't think a developer will assume this is expected of him. When the archive is created, it has some intended purpose. The creator should choose if he wants hardlinks or not. We can continue to dereference hard links by default and let the user opt-in to using hardlinks if that matches their use-case. Then there is no breaking change. Or we choose to include hard links by default, and then the breaking change is that users might need to change their code to opt-out of the hardlinks. And perhaps an appctx switch that forces extraction to dereference hardlinks. |
@rzikm @ericstj if this makes sense, I can add it in this PR. It will apply to For For When an application uses the |
|
Sounds reasonable to me |
|
@tmds I was actually advising against a proactive AppContext switch, API seems more appropriate and inline with what GNU Tar and other Tar implementation have. We change the default, and then provide the API that let's folks choose to de-reference at production time if they want. AppContext switch (on production side) can come later if we find folks complain and can't use the API. I think you might have misread my advice around testing as well. I am saying we just look into how someone might extract an archive with hardlinks, and double check that they have a path to do so where they can dereference at extraction time. It might mean calling a different API that extracts files individually after inspecting the entry type. As with any breaking change, we need to provide workarounds / migration advice. I'm just asking that we think about that up front and add the test cases to validate it. I couldn't find precedent in any other tar implementation for auto-dereferencing links at extraction - but other framework API does allow for a client app to choose to do so. |
Yes, I also think we need an API that enables users to control the dereferencing behavior.
This is in existing API (and tests): the The The suggested switch was to enable an end-user to force the dereferencing of hard links through an envvar in case the developer had not forseen such a code path it in the app and their filesystem doesn't support hard links. I won't add an app switch in this PR, and I'll add an API proposal to #74404. Do we need the API proposal to be approved/implemented before merging this PR? |
I would be fine with merging this PR early, but having the change and new API all in one place would make marginally easier to link to the breaking change documentation (as the API for recommended steps would already exist). |
|
@rzikm @ericstj I've written an API proposal here: #74404 (comment). |
Got it. We can make that clear in the docs. API proposal looks great. I wasn't sure if you wanted to go so far as to implement dereferencing during extraction, but I do think it's a useful option. If we have that then it's less of an issue to describe and test the workarounds since folks can just call the new API (which itself will have tests). |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarWriter.Unix.cs
Outdated
Show resolved
Hide resolved
|
I have no more changes for this PR. When API proposal from #74404 is approved, I'll work on it. |
🤖 Copilot Code Review — PR #123874Holistic AssessmentMotivation: The PR addresses a real gap — previously, hard-linked files were duplicated in tar archives, wasting space and losing the hard link relationship. The linked issue (#74404) clearly documents the need, and GNU tar exhibits the same default behavior. Approach: The implementation correctly tracks files by unique identifiers (dev+ino on Unix, volume+fileIndex on Windows) using lazy-allocated dictionaries. The first occurrence becomes a regular file entry; subsequent occurrences become Summary: Detailed Findings
|
|
@tmds I will move the PR to draft, since this is effectively blocked on API pending review. |
|
@rzikm this could be merged. The API can be implemented in a subsequent PR. I'm fine either way. |
Fixes #74404.
@dotnet/area-system-formats-tar ptal.
cc @MichaelSimons