Skip to content

Generate reproducible tarball#9674

Closed
tamalsaha wants to merge 1 commit intohelm:mainfrom
x-helm:repobuild
Closed

Generate reproducible tarball#9674
tamalsaha wants to merge 1 commit intohelm:mainfrom
x-helm:repobuild

Conversation

@tamalsaha
Copy link
Contributor

@tamalsaha tamalsaha commented May 7, 2021

Signed-off-by: Tamal Saha [email protected]

What this PR does / why we need it:
Currently helm package does not produce reproducible tarballs. This change sets the timestamps to 0 to fix this. We store the tarballs in git repos. This helps us avoid unnecessary changes everytime we run helm package.

See also: https://unix.stackexchange.com/a/438330/42136

Special notes for your reviewer:

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

@helm-bot helm-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 7, 2021
@helm-bot helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2021
@helm-bot helm-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 7, 2021
@mattfarina
Copy link
Collaborator

I think #3612 is the bug being fixed by this.

@mattfarina
Copy link
Collaborator

I have an open question on this PR with the other maintainers. That is, is the removal of the timestamp a breaking change? We have been discussing these nuances lately. I added this issue to the agenda for the next meeting.

Name: filepath.ToSlash(name),
Mode: 0644,
Size: int64(len(body)),
ModTime: time.Now(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing ModTime will reintroduce behavior described in #4158, which was fixed in #4161 and ported to V3 in #7272

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the current implementation of modTime is also wrong because it just sets the time when helm package was run. If it wants to be correct, it should set the original template files timestamp.

Also, zero-ing out timestamps is a common pattern for reproducible builds GoogleContainerTools/distroless#242

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, @baijum is right. We tried this, and someone reported it as a bug that was addressed with #4161. To go back and re-implement this again would be repeating past mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be this can be added as a flag to helm package command for those of us want to have reproducible charts.

Copy link
Member

@bacongobbler bacongobbler May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone wants a reproducible build. It does not need to be hidden behind a flag. What I am trying to convey is that stripping mtime from the file is NOT the right way to go about addressing the issue.

#3612 (comment) suggests the issue is not due to timestamps, but instead the sorted order of the files within the tar file.

Can you please look further into the suggestions made in that thread before coming up with a solution. Even if you strip mtime fields from the archive, you may still not produce a reproducible build. We've looked into this issue multiple times and discovered it's not the mtime of the original files causing the issue here. Or at least, there are multiple layers causing this issue, and to solve it we need to address them all.

We tested this theory in another area and discovered stripping mtime from the archive does not resolve the issue. See #8216 (review).

In other words, it is possible that

  1. preserving the original file's mtime, and
  2. sorting files by name (as you've done so in this PR)

May fix this issue. but without any test infrastructure in place to confirm, we cannot prove it.

If you're attempting to fix something, please provide tests to ensure that your change addresses the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a backwards-compatible flag, like helm package --timestamp where timestamp defaults to true, but users can set it to false for reproducibility on that level.

Copy link
Member

@bacongobbler bacongobbler May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the developer call today, the first step would be to write a test to identify the symptom. Then write a fix to alleviate the symptom.

It is possible that we only need to sort the files in the archive to fix the issue, which means there would not be any need for stripping file mtimes (and therefore no need for a --timestamp flag).

We did test the theory of stripping file mtimes before and that did not allay the symptom. @jdolitsky can also attest to this as he and his coworker tried to fix this for helm chart save. So I'm not confident that stripping file mtimes will fix the problem.

Let's get some tests in first before we start providing solutions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tried without the timestamp going to 0. But, I suspect any timestamp difference will cause a different hash. A different time stored in the tar when everything else is the same is still different content that would produce a different hash.

In another part of the PR there is sorting on the content. The order of the files matters and I hope the change here solves the issue. I have not checked, though.

If the timestamp needs to be 0 then that would need to be hidden behind a flag. We can't break backwards compatibility. I would suggest a name like --reproducible-build.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be aware that Git doesn't preserve modification time on files. So an Application Distributor cannot create a reproducible build by preserving the original file's modification time because the modified time will be different whenever the chart source checks out. I think using a flag (--reproducible-build) and setting timestamp to 0 looks acceptable to me—some tests to avoid regression would be great.

Copy link
Collaborator

@mattfarina mattfarina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this pull request together.

One thing this PR needs is tests to make sure the tarball build is reproducible. We need to build a package at least twice and compare the hash of them to verify the fix works.

This is important for Helm so that we can catch a regression in the future.

@mattfarina
Copy link
Collaborator

@tamalsaha are you interested or able to write the tests for this? If not, someone else can pick it up.

@tamalsaha
Copy link
Contributor Author

If not, someone else can pick it up.

If someone else can pick up that will be really great! Thanks.

@tamalsaha tamalsaha closed this Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants