Skip to content

Resolving Binary Provenance in Spack for Git Ref Versions #48121

@psakievich

Description

@psakievich

Summary

This issue outlines the known body of work and challenges with enabling git sha assignments
to binaries using git ref based versions. i.e. branches and tags. The long-term vision is that spack will always perform this operation so every binary can have complete binary provenance. This issue is to outline what is known and serve
as a focal point on the discussion of:

  1. How to implement this feature
  2. If it should be initially rolled out with a configuration option
  3. Tracking work across multiple PR's

Tasks to be implemented

This is a running list of things that need to be done to resolve this issue:

Required

Requested

  • Adding commit verification to a src mirror (target 1.1)
  • Changing/configuring version projections for git ref versions (Likely won't do)
  • Git version src mirror content verification (target 1.1)
  • Add git ref version support to package definitions (Likely won't do)

Introduction:

As noted in #3112 and #4674 Spack's git based versions have holes for binary provenance.
We currently allow versions with the git attribute or versions of packages that have a git
attribute to be declared off of a git reference i.e. commit, tag, or branch.
commit's are fixed in the git history, but tag's and branchs are really just references
to commit's and are therefore mutable with time.

This leads to issues where users installing a foo@main, where the main version is associated with
the main branch of the git repository, can have two installations that are exactly the same spack spec (same full hash),
but actually have different binaries because the associated commit sha for each build was different.
These specs, and therefore binaries, have an incomplete provenance.

In #30998 a syntax was added that allowed git based versions to be set by the user with the
@git.[ref]=[spack version] syntax. Functionally this maps to @git.[actual source code]=[spack's model of the source code]. The [ref] determines the actual code and drives the stage operation and the [spack version] is used
to determine spec satisfaction for version based rules during concretization.
These versions have an asymmetric satisfaction criteria to safely enable reuse:

a = Version("main")
b = Version("git.[sha]=main")
assert b.satisfies(a)
assert not a.satisfies(b)

In the Spack-Manager project a "pin" extension
was developed that combed through a concretized graph and used git ls-remote to find the exact commit
hash for and branch based version, rewrite the spec using the @git.[latest sha]=[branch based version], and
update the spack environment through a second concretization.

Assigning the commit sha directly provides complete binary provenance for any git ref based version and so it is desirable
for this functionality to exist as the native behavior of Spack. #44319 provides an initial implementation of this where after the concretizer finds a solution any spec that has a version associated with a branch or tag will use git ls-remote to find the current commit sha and then that spec's version selected from concretization is paired with the commit sha.

This ensures binary provenance by always translating any git ref to a commit. It also simplifies fetching since it can always be done based on the commit.

There are several known issues and related requests with moving this to the default behavior:

Known issues

User Queries

Historically, users query off the version they supplied i.e. spack find trilinos@master. In this case the bookmark of binary provenance was lost (commit sha), but the user input was preserved. Using git ref versions a user can specify a version foo@git.[tag]=[version]. This representation technically has 3 components:

  1. user requested ref [tag]
  2. modeled version [version]
  3. the actual commit used in the checkout.

#44319 replaces the item 1 for 3 which now accounts for the binary provenance, but breaks the standard user query API.
spack find foo@git.[tag]=[version] would not return a value since the spec was transformed to foo@git.[commit]=[version] during concretization. To preserve the desired behavior we need a way to account for all 3 components of the version and not swap one for another.

Git ref lookup

Spack currently supports a syntax for versions where users can just specify a git ref without a modeled version i.e. foo@git.[ref]. These specs are ambiguous to spack and so a feature has been implemented to traverse the git history of the package and find the closest matching spack version. This suffers from the same issues as above but complicates things in that spack now has to discover two components instead of 1. The level of use this feature sees is questionable and deprecation/removal might be warranted.

Source Mirrors

#44319 uses git ls-remote to find the latest hash as a post concretization step. This is not usable with source mirrors
since the git repo is tar'd and not guaranteed to have the git history. We need to implement an additional adaptation to pull the commit hash from the source mirror. Additionally, there is a question of going to the source mirror or the git url first.
Spack's default behavior is to check a mirror first to require no behavior changes on air-gapped systems where mirrors
are almost always required. For this feature there is an argument to reverse that behavior. If users have a mirror configured
but are not in an air-gapped environment it is reasonable for them to assume that the remote will be queried for the latest commit without having to first update the mirror. This would require a version specific exception for spack's default behavior.

Concretization Concerns

Automatically querying for the latest hash may cause large impacts on usability for users who have git versions in in the trunk or leaves of their graph. Forcing reconcretization as it exists today would lead to an additional cadence of graph rebuilds driven by these project's git traffic. Unexpected rebuilds is one of the leading sources of user frustration. This adds in an additional source of complexity that is also completely outside of Spack's ability to control. It seems wise to add in a guard or set of additional concretization logic before making this the default and only behavior in spack.

An ideal implementation seems like it would be for the concretizer to not update git ref's unless requested. Essentially a softer version of --force. An alternative could be a pin command like what has been added in spack-manager which changes the spack.yaml so that roots with a paired sha are changed to include the sha in the version, and any dependencies relying on git ref versions are also added.

So this:

# foo@main also dependes on bar@main
spack:
  specs:
    - foo@main 

Would change to this:

spack:
  specs:
  - foo@git.[sha]=main ^bar@git.[sha]=main

A change like this has the undesirbale side afffect of changing spack.yaml but that seems to be the only way to guard against forced concretization without changing the concretizer's behavior.

Projections

#44328 notes that the projections and installation paths for these versions are very long. Additionally, SNL staff using the pin command developed in Spack-Manager have noticed that the = character can cause path break issues for certain CI and automated processes where the character is not supported for paths.

Both of these are fixable with some adjustments for projections, but this is another instance where custom behavior will need to be added for this version behavior change.

First class version citizenship

In the initial implementation we did not add support to packages for the git ref version because it seemed questionable to allow this level of blurring what a version is inside the packaging ecosystem. While not strictly related to this feature implementation, it seems like completeness would include adding package support if they are going to be a native products of concretizaiton now instead of just user created artifacts. Furthermore, adding a dynamic property like commits to the version complicates our current paradigm of using versions as dictionary keys to pair to all their attributes. This impacts the hash and simplicity of construction for current versions.

It is also worth noting that several users have wanted to add depends_on("foo@git.[sha]=main) dependencies into their packages. The argument here is compelling. As a package maintainer I know I need this specific commit of my dependency, and they do not have a version that meets my needs.

Proposed Implementations

Syntax

Most of the syntax issues can be resolved from adding bookkeeping for all 3 components:

  1. user requested ref [tag]
  2. modeled version [version]
  3. the actual commit used in the checkout.

The main idea that has been discussed so far is to store item 1 as metadata on the version. In this method if a user requested a spec with @git.mytag=main that concretized to @git.[hash]=main then spack find [package]@git.mytag=main would return [package]@git.[hash]=main. This implementation would require additional conditionality for the satisfaction checks of versions.

One alternative idea for handling binary provenance would be moving the git commit sha from the version, to the spec. Post concretization the commit sha could be added as a variant similar to how dev_path is added for develop specs. This captures the binary provenance data that is currently being lost as a node attribute rather than a version attribute.

Source mirrors

The source mirror implementation that has been proposed is to overload the operator that calls git ls-remote with an implementation to pull the sha from a mirror tarball. Ideall this will just be reading the name of the cached file to get the sha. There is a lingering security concern as noted in #3112. This change adds the information that could be used to solve this. If the mirror is ensured to contain the git meta-data for the current commit then verification can happen when the tar file is unpacked. Simplly comfirnming the git sha would certainly be possible. This would need to be implemented. An additional follow-on could be to use the tree hash, but that would also need to be obtained/provided to ensure security.

Concretization

There is currently no scoped/approved effort to address concerns with concerns of increased misses and rebuilds from tighter coupling of versions and git metadata. The idea of a pin command proposed in concretization concerns is driven by the unsavory methodology of modifying root-specs. It will also not work with matricies.

Conclusion

It is very desirable to get binary provenance for all version types in place. We want to get to a state where all binaries generated by spack have a complete provenance. However, The number of changes and areas of impact that have been identified so far seem enough to warrant adding in a configuration flag to turn it off and on for a time.

In my opinion the issue of repeated re-concretization is sufficient to delay rolling this feature as the native
behavior of spack until these issues can be addressed. There are also a number of details and tasks that need to be performed which are not high priority, but would greatly improve the user experience when making commit sha assignment automated. It would be nice to get a base implementation into develop so the community can fan out and implement these features in parallel according to their interests and needs sooner than later.

I would propose that we merge with a configuration flag ASAP, and set a minimal target to remove that flag. Ideally we can meet our target to get the flag removed with the release of 1.0 so it will just be a temporary flag that existed between 0..23 and 1.0. If not then we carry it through the standard release deprecation cycle.

Finally, I have plans for this feature to help with my efforts on modeling a mono-repo as a series of spack packages. I would selfishly like to see the syntax finalized and merged ASAP to enable/accelerate binary caching to developers with versions pinned to commits rather than branches.

Rationale

No response

Description

No response

Additional information

No response

General information

  • I have searched the issues of this repo and believe this is not a duplicate

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions