Skip to content

Tar: archive creation should detect hard links to same file #74404

@tmds

Description

@tmds

Background and Motivation

Currently hard links to the same file get duplicated in the archive.

Instead, when additional hard links to the same file are encountered, it should be possible to store them as hard links to the first entry.

When hardlink entries are present, the current implementation will already extract them as hard links.

Hard links in tar archives can create difficulties when extracting to file systems that do not support them.

The following API proposal enables a user to:

  • Extract hard links as independent file copies instead of creating actual hard links
  • Create archives where hard-linked files are stored as separate files

API Proposal

namespace System.Formats.Tar;

public class TarWriter
{
    // New overload accepting options
    public TarWriter(Stream stream, TarWriterOptions options, bool leaveOpen = false);
}

public static class TarFile
{
    // New overloads accepting options for creation
    public static void CreateFromDirectory(string sourceDirectoryName, Stream destination, TarCreateOptions options);
    public static void CreateFromDirectory(string sourceDirectoryName, string destinationFileName, TarCreateOptions options);
    public static Task CreateFromDirectoryAsync(string sourceDirectoryName, Stream destination, TarCreateOptions options, CancellationToken cancellationToken = default);
    public static Task CreateFromDirectoryAsync(string sourceDirectoryName, string destinationFileName, TarCreateOptions options, CancellationToken cancellationToken = default);

    // New overloads accepting options for extraction
    public static void ExtractToDirectory(Stream source, string destinationDirectoryName, TarExtractOptions options);
    public static void ExtractToDirectory(string sourceFileName, string destinationDirectoryName, TarExtractOptions options);
    public static Task ExtractToDirectoryAsync(Stream source, string destinationDirectoryName, TarExtractOptions options, CancellationToken cancellationToken = default);
    public static Task ExtractToDirectoryAsync(string sourceFileName, string destinationDirectoryName, TarExtractOptions options, CancellationToken cancellationToken = default);
}

// New class
public sealed class TarCreateOptions
{
    // Corresponds to the arg being added in https://github.com/dotnet/runtime/pull/123407
    public TarEntryFormat Format { get; set; } = System.Formats.Tar.TarEntryFormat.Pax;
    // Corresponds to existing CreateFromDirectory argument.
    public bool IncludeBaseDirectory { get; set; } = false;

    /// This value is passed to TarWriterOptions.DereferenceHardLinks.
    public bool DereferenceHardLinks { get; set; } = false;
}

// New class
public sealed class TarWriterOptions
{
    // Corresponds to existing constructor argument.
    public TarEntryFormat Format { get; set; } = System.Formats.Tar.TarEntryFormat.Pax;

    /// When set to true, TarWriter.WriteEntry(string fileName, string? entryName) and TarWriter.WriteEntryAsync(string fileName, string? entryName, CancellationToken)
    /// will store hard-linked files as separate entries in the archive.
    public bool DereferenceHardLinks { get; set; } = false;
}

// New class
public sealed class TarExtractOptions
{
    // Corresponds to existing ExtractToDirectoryAsync argument.
    public bool OverwriteFiles { get; set; } = false;

    /// When set to true, TarEntryType.HardLink entries will be restored as a copy of the linked file instead of creating a hard link.
    public bool DereferenceHardLinks { get; set; } = false;
}

API Usage

Extracting with Dereferenced Hard Links

// Extract tar archive with hard links converted to independent file copies
TarFile.ExtractToDirectory(
    "archive.tar",
    "/path/to/output",
    new TarExtractOptions
    {
        OverwriteFiles = true,
        DereferenceHardLinks = true  // Hard links become independent file copies
    }
);

Creating Archives with Dereferenced Hard Links

// Create tar archive storing hard-linked files as separate entries
await TarFile.CreateFromDirectoryAsync(
    "/path/to/source",
    "output.tar",
    new TarCreateOptions
    {
        Format = TarEntryFormat.Pax,
        DereferenceHardLinks = true  // Store hard links as separate files
    }
);

Using TarWriter with Options

using var stream = File.OpenWrite("archive.tar");
using var writer = new TarWriter(
    stream,
    new TarWriterOptions
    {
        Format = TarEntryFormat.Pax,
        DereferenceHardLinks = true
    }
);

writer.WriteEntry("/path/to/file", "entry-name");

Alternative Designs

  • An alternative would be to add a bool dereferenceHardLinks parameter directly to existing methods. The proposed options classes provides extensibility without having to add additional overloads for future arguments.

  • The proposed default is DereferenceHardLinks = false, this is a change from earlier .NET versions which are duplicating the files.

  • Instead of adding separate types for TarExtractOptions and TarWriterOptions, one type could be used and IncludeBaseDirectory could be "ignored" by the TarWriter constructor.

  • Instead of TarCreateOptions duplicating all TarWriterOptions properties, we can reference that type instead:

// New class
public sealed class TarCreateOptions
{
    // Corresponds to existing CreateFromDirectory argument.
    public bool IncludeBaseDirectory { get; set; } = false;

    public TarWriterOptions TarWriterOptions { get; set; } = new (); // possibly lazy-init during get
}

Or replace by the (bool includeBaseDirectory, TarWriterOptions writerOptions) parameter pair in the TarFile.CreateFromDirectory methods

Notes

  • This proposal does not include TarReaderOptions.DereferenceHardLinks because when a TarEntry is read via TarReader, the TarEntry.ExtractToFile(string destinationFileName) method has no base directory context to resolve the LinkName and locate the file that should be copied. When using TarFile.ExtractToDirectory, the base directory is known.

  • DereferenceHardLinks only applies to hard links. The behavior of symlinks can be made configurable by adding another property to the Options classes.

Metadata

Metadata

Assignees

Labels

api-ready-for-reviewAPI is ready for review, it is NOT ready for implementationarea-System.Formats.Tarhelp wanted[up-for-grabs] Good issue for external contributorsin-prThere is an active PR which will close this issue when it is merged

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions