Skip to content

Comments

TarWriter: when writing entries for the same hard-linked file use HardLink entries.#123874

Draft
tmds wants to merge 7 commits intodotnet:mainfrom
tmds:tarwriter_hardlinks
Draft

TarWriter: when writing entries for the same hard-linked file use HardLink entries.#123874
tmds wants to merge 7 commits intodotnet:mainfrom
tmds:tarwriter_hardlinks

Conversation

@tmds
Copy link
Member

@tmds tmds commented Feb 2, 2026

Fixes #74404.

@dotnet/area-system-formats-tar ptal.

cc @MichaelSimons

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 2, 2026
@tmds
Copy link
Member Author

tmds commented Feb 2, 2026

Note that extracting an archive with hard links to a file system that doesn't support them will fail.
This matches the behavior of GNU tar.

With GNU tar, you can specify the --hard-dereference option. On creation, that causes there to be no hard links in the archive. On extraction, it causes hard links to be extracted as separate files.

@rzikm rzikm requested a review from a team February 2, 2026 09:38
Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

@rzikm rzikm requested a review from a team February 2, 2026 10:15
@tmds tmds force-pushed the tarwriter_hardlinks branch from 0eedbe9 to 698d3ae Compare February 2, 2026 10:19
@tmds
Copy link
Member Author

tmds commented Feb 3, 2026

For #123874 (comment), we can make this behavior controllable by the user. For backwards compatibility we might want to keep dereferencing on tar creation by default.

@bartonjs @karelz this would be another use of an options class for tar (#69780). These options are for the reader/writer. TarWriterOptions/TarReaderOptions might be more interesting than TarExtractOptions/TarCreateOptions.

@rzikm
Copy link
Member

rzikm commented Feb 3, 2026

I would like to make the new behavior default (to be more like the GNU Tar many people are familiar with), but I agree with the backwards compatibility concerns.

Should we hide it for now behind an appctx switch until we add the new API surface? @ericstj any thoughts?

@tmds
Copy link
Member Author

tmds commented Feb 3, 2026

I would like to make the new behavior default (to be more like the GNU Tar many people are familiar with)

For GNU tar the behavior makes sense due to these expectations: hard links are not common, file systems are likely to support them.

We could make the same assumptions for .NET.

Should we hide it for now behind an appctx switch until we add the new API surface?

So, the switch would be for opting in to the previous behavior which dereferences hard links on tar creation.

@ericstj ericstj added the breaking-change Issue or PR that represents a breaking API or functional change over a previous release. label Feb 3, 2026
@dotnet-policy-service dotnet-policy-service bot added the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Feb 3, 2026
@ericstj
Copy link
Member

ericstj commented Feb 3, 2026

One of the challenges with making it breaking by default is that user's might not discover they are broken right away. It's only later when the file is extracted that folks discover this.

Should we hide it for now behind an appctx switch until we add the new API surface

AppContext switches are for "legacy" behavior where we know folks are broken. In this case I think an API is more appropriate since some people may prefer to produce a more portable archive. In addition to file-systems that don't support this, I think it's also a challenge for non-seek-able streams. I'd prefer us adding an API for it, like Gnu TAR has, changing the default, and only if folks complain about being broken in-place do we introduce an AppContext switch to rollback.

GNU Tar doing this by default is a strong signal for us to do the same.

Note that extracting an archive with hard links to a file system that doesn't support them will fail.
This matches the behavior of GNU tar.

Do we expose enough information from the archive entries such that someone could implement extraction which falls-back to copy? Is the exception we throw in this case recoverable such that someone could fallback to doing a copy and continuing extraction?

Similar case for compression- we should have API that lets the user choose to either represent the links or duplicate.

It would be good to enumerate all the scenarios that might be broken and how one could workaround those on the consumption side. Those can be protected in test cases, and also included in a breaking change doc.

@tmds
Copy link
Member Author

tmds commented Feb 3, 2026

Do we expose enough information from the archive entries such that someone could implement extraction which falls-back to copy? Is the exception we throw in this case recoverable such that someone could fallback to doing a copy and continuing extraction?

I don't think this is straight forward and I don't think a developer will assume this is expected of him.

When the archive is created, it has some intended purpose. The creator should choose if he wants hardlinks or not.

We can continue to dereference hard links by default and let the user opt-in to using hardlinks if that matches their use-case. Then there is no breaking change.

Or we choose to include hard links by default, and then the breaking change is that users might need to change their code to opt-out of the hardlinks. And perhaps an appctx switch that forces extraction to dereference hardlinks.

@tmds
Copy link
Member Author

tmds commented Feb 4, 2026

And perhaps an appctx switch that forces extraction to dereference hardlinks.

@rzikm @ericstj if this makes sense, I can add it in this PR.

It will apply to TarFile.ExtractToDirectory, but not to TarEntry.ExtractToFile.

For ExtractToDirectory, instead of hardlinking to the path, it will find the file at that path and copy it.

For ExtractToFile it can't do anything since there's no directory to use as a base for finding the file that is linked to. It will throw InvalidOperationException as it is already doing.

When an application uses the TarFile.ExtractToDirectory and that causes an issue due to the target filesystem not supporting hard links, we'll make it possible for the end-user to set the switch through an envvar so that it will extract the files separately instead of using hard links.

@rzikm
Copy link
Member

rzikm commented Feb 4, 2026

Sounds reasonable to me

@ericstj
Copy link
Member

ericstj commented Feb 5, 2026

@tmds I was actually advising against a proactive AppContext switch, API seems more appropriate and inline with what GNU Tar and other Tar implementation have. We change the default, and then provide the API that let's folks choose to de-reference at production time if they want. AppContext switch (on production side) can come later if we find folks complain and can't use the API.

I think you might have misread my advice around testing as well. I am saying we just look into how someone might extract an archive with hardlinks, and double check that they have a path to do so where they can dereference at extraction time. It might mean calling a different API that extracts files individually after inspecting the entry type. As with any breaking change, we need to provide workarounds / migration advice. I'm just asking that we think about that up front and add the test cases to validate it. I couldn't find precedent in any other tar implementation for auto-dereferencing links at extraction - but other framework API does allow for a client app to choose to do so.

@tmds
Copy link
Member Author

tmds commented Feb 5, 2026

API seems more appropriate and inline with what GNU Tar and other Tar implementation have.

Yes, I also think we need an API that enables users to control the dereferencing behavior.

I am saying we just look into how someone might extract an archive with hardlinks, and double check that they have a path to do so where they can dereference at extraction time.

This is in existing API (and tests): the TarReader returns a TarEntry with EntryType set to HardLink and the path it links is in the LinkName property. That path needs to be combined by the user to the location where the archive is being extracted. The user should ensure that the link is not being made to some location outside the destination directory. (This matches what TarFile.ExtractToDirectory does for HardLink entries.)

The TarFile.ExtractToDirectory API is not meant to be "resumed" at the point of failure. A developer might call it again with the API that dereferences hardlinks if he detects the IOException that is thrown for hard link creation.

The suggested switch was to enable an end-user to force the dereferencing of hard links through an envvar in case the developer had not forseen such a code path it in the app and their filesystem doesn't support hard links.

I won't add an app switch in this PR, and I'll add an API proposal to #74404.

Do we need the API proposal to be approved/implemented before merging this PR?

@rzikm
Copy link
Member

rzikm commented Feb 5, 2026

I won't add an app switch in this PR, and I'll add an API proposal to #74404.

Do we need the API proposal to be approved/implemented before merging this PR?

I would be fine with merging this PR early, but having the change and new API all in one place would make marginally easier to link to the breaking change documentation (as the API for recommended steps would already exist).

@tmds
Copy link
Member Author

tmds commented Feb 5, 2026

@rzikm @ericstj I've written an API proposal here: #74404 (comment).

@ericstj
Copy link
Member

ericstj commented Feb 5, 2026

The TarFile.ExtractToDirectory API is not meant to be "resumed" at the point of failure. A developer might call it again with the API that dereferences hardlinks if he detects the IOException that is thrown for hard link creation.

Got it. We can make that clear in the docs.

API proposal looks great. I wasn't sure if you wanted to go so far as to implement dereferencing during extraction, but I do think it's a useful option. If we have that then it's less of an issue to describe and test the workarounds since folks can just call the new API (which itself will have tests).

@rzikm rzikm self-assigned this Feb 9, 2026
@tmds
Copy link
Member Author

tmds commented Feb 11, 2026

I have no more changes for this PR. When API proposal from #74404 is approved, I'll work on it.

@stephentoub
Copy link
Member

🤖 Copilot Code Review — PR #123874

Holistic Assessment

Motivation: The PR addresses a real gap — previously, hard-linked files were duplicated in tar archives, wasting space and losing the hard link relationship. The linked issue (#74404) clearly documents the need, and GNU tar exhibits the same default behavior.

Approach: The implementation correctly tracks files by unique identifiers (dev+ino on Unix, volume+fileIndex on Windows) using lazy-allocated dictionaries. The first occurrence becomes a regular file entry; subsequent occurrences become TarEntryType.HardLink entries pointing to the first. This is the standard tar approach.

Summary: ⚠️ Needs Human Review. The implementation is fundamentally correct and follows established patterns. However, there are a few concerns worth human verification: (1) a type mismatch between native and managed struct fields, (2) potential Windows ReFS file identifier limitations, and (3) breaking change documentation requirements.


Detailed Findings

⚠️ Type Mismatch — HardLinkCount native vs managed type

The native header (pal_io.h line 37) defines HardLinkCount as uint32_t, but the C# struct (Interop.Stat.cs line 37) defines it as int (signed). While this works in practice (hard link counts are always positive and fit in either type), it's an inconsistency with the rest of the struct where types match precisely (e.g., uint32_t Uiduint Uid).

Suggestion: Change the C# field to internal uint HardLinkCount; for consistency and to avoid any future confusion.

✅ File Identifier Logic — Correctly handles uniqueness

Both implementations correctly use the standard file identity approach:

  • Unix: (Dev, Ino) tuple uniquely identifies a file across the system
  • Windows: (dwVolumeSerialNumber, nFileIndexHigh << 32 | nFileIndexLow) uniquely identifies a file on NTFS

The logic correctly restricts tracking to regular files only (Unix checks S_IFREG, Windows excludes Directory and ReparsePoint).

✅ Dictionary Lazy Allocation — Good performance pattern

_hardLinkTargets is allocated lazily (??=) only when a file with multiple hard links is encountered. This avoids allocations when archiving directories without hard links.

💡 Windows ReFS Consideration — Low-confidence observation

On ReFS volumes, GetFileInformationByHandle only returns the lower 64 bits of the 128-bit file identifier. This creates a theoretical collision risk on extremely large ReFS volumes. However, this is an edge case that:

  1. Affects few users (ReFS is primarily server/enterprise)
  2. Would require an enormous number of files to collide
  3. Can be addressed in a follow-up if reported

No action required for this PR, but worth documenting if users report issues.

✅ Test Coverage — Adequate for the feature

The new tests cover:

  • Basic hard link creation and reading back (WriteEntry_HardLinks)
  • Round-trip through ExtractToDirectory (HardLinkExtractionRoundtrip)
  • Multiple hard link pairs with different entry names
  • All tar formats (V7, Ustar, Pax, Gnu)

⚠️ Breaking Change Documentation — Required before merge

The PR is labeled breaking-change and needs-breaking-change-doc-created. Per the discussion in PR comments, this changes default behavior:

  • Before: Hard-linked files were written as separate regular file entries (duplicated content)
  • After: Second and subsequent hard links are written as HardLink entries (no data duplication)

This could break consumers that extract to filesystems not supporting hard links, or that don't handle HardLink entries. The breaking change documentation should be created as noted in the labels.

✅ Cross-Platform Consistency — Both platforms behave identically

The Unix and Windows implementations produce semantically equivalent tar archives — the first occurrence of a hard-linked file contains data, subsequent occurrences are hard link entries pointing to the first.


Summary

The code changes are correct and implement a standard tar feature that matches GNU tar behavior. The main concerns are:

  1. Should fix: The HardLinkCount type mismatch (int vs uint32_t) is a minor inconsistency
  2. Required: Breaking change documentation needs to be created per the PR labels
  3. Monitor: ReFS file identifier limitation is low-risk but worth noting

The implementation is solid and the approach matches industry standards. Human reviewers should verify the breaking change implications for their consumers are acceptable.


Review performed by Copilot using Claude Sonnet 4, Gemini 3 Pro, and GPT-5.1 models

@rzikm
Copy link
Member

rzikm commented Feb 12, 2026

@tmds I will move the PR to draft, since this is effectively blocked on API pending review.

@rzikm rzikm marked this pull request as draft February 12, 2026 14:19
@tmds
Copy link
Member Author

tmds commented Feb 12, 2026

@rzikm this could be merged. The API can be implemented in a subsequent PR. I'm fine either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Formats.Tar breaking-change Issue or PR that represents a breaking API or functional change over a previous release. community-contribution Indicates that the PR has been added by a community member needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tar: archive creation should detect hard links to same file

4 participants