Skip to content

spack develop: use git to detect changes#41013

Draft
scheibelp wants to merge 43 commits intospack:developfrom
scheibelp:features/develop-use-git
Draft

spack develop: use git to detect changes#41013
scheibelp wants to merge 43 commits intospack:developfrom
scheibelp:features/develop-use-git

Conversation

@scheibelp
Copy link
Copy Markdown
Member

@scheibelp scheibelp commented Nov 10, 2023

When a user edits in an environment packages marked as develop (using spack develop), Spack checks the timestamps on all those files to figure out whether a given spec needs a rebuild.

For packages that are developed using Git, it is faster to defer to Git's change detection: this PR updates Spack to maintain an additional file inside of the source directory of the developed package to track the prior hash state, and then compares that with the current Git hash to detect whether any files have changed.

In part, this speedup is achieved by avoiding untracked files, and for now is only enabled when using the detect-changes-with-git environment property (the default is to use timestamps).

This borrows some ideas from #35442 (e.g. more-abstract tests that rely on mocking over monkey-patching).

@scheibelp scheibelp added the WIP label Nov 10, 2023
@spackbot-app spackbot-app bot added core PR affects Spack core functionality environments stand-alone-tests Stand-alone (or smoke) tests for installed packages tests General test capability(ies) labels Nov 10, 2023
@scheibelp scheibelp marked this pull request as draft November 10, 2023 20:08
@scheibelp scheibelp removed the WIP label Nov 14, 2023
@tgamblin tgamblin changed the title "spack develop": use git to detect changes spack develop: use git to detect changes Dec 13, 2023
Comment on lines +499 to +503
# Note: originally I thought about returning NOT_CHANGED
# if the git state is "clean", but technically a user
# could have initially installed with a dirty state, and
# then reverted their changes.
return GitRepoChangeDetector.NO_PRIOR
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a good reason not to just say that this is CHANGED? They're likely to need to build the first time anyway.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users who were depending on timestamp detection, this allows for them to use timestamps the first time after pulling this (and also calculate the Git state, so the next time, it can just use Git)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying why not build the first time, record that it was built, and calculate the git hash at that time, rather than relying on timestamps?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but to clarify: if we do it that way, we may possibly rebuild unnecessarily the first time we run this logic after pulling the new feature. I decided it would be worthwhile to detect that case and fall back on timestamps, but it sounds like you don't see that as particularly useful.

IMO there's no issue with forcing a rebuild the first time to simplify the logic, but this assumes that someone would in fact find it simpler for us to default to indicating there was a change when we don't have a prior (which IMO is debatable).

Anyhow, I'll assume that I should remove the NO_PRIOR logic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the git logic has to be unaware of untracked files? We could use git status --porcelain to see even unknown files in the directory, e.g. in my spack repo right now:

> git status --porcelain
 M lib/spack/spack/traverse.py
?? check_all_dicts.py
?? non-const
?? nonconst.py

One modified file and 3 unknowns -- pretty easy to parse, and git status is probably the fastest way to find this out. You could add the unknowns to the index, which would include them in the tree hash.

Also, it seems like we could use this to avoid a full traversal of the directory. You know that git status reports only what is different from the last commit, and you know the modification date of the last commit. if we find that we have no prior, couldn't we avoid a full traversal by looking at the last commit and the modification date of any files mentioned by git status? You can terminate early if you find anything newer than the last build of the package.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this appears to be a response to #41013 (comment), so I'll respond with that in mind)

Is there a reason the git logic has to be unaware of untracked files?

It does not have to be unaware, but my guess is most of the speedup over timestamp checking is achieved by not looking for untracked files.

One modified file and 3 unknowns -- pretty easy to parse,

Parsing is not my concern. I think git status has to do a traversal, and I think that traversal is costly. For example, running git status on Spack's directory in an NFS dir yields:

$ git status
...
It took 13.95 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').

FWIW, repeated invocations of git status in this context take much less time - only ~3 seconds or so - although I'm not sure if that's on account of FS caching or something smart that Git is doing.

You know that git status reports only what is different from the last commit, and you know the modification date of the last commit. if we find that we have no prior, couldn't we avoid a full traversal by looking at the last commit and the modification date of any files mentioned by git status?

I don't know how git status itself finds untracked files other than traversing directories, e.g.

git status
touch a/file/deep/within/the/directory/hierarchy
git status # how would Git know where to look?

FWIW I assume that whenever Git has to do traversals, it does that as quickly as it can (which is to say, even if we did force git status it would still probably be faster than timestamp checking).

You could add the unknowns to the index, which would include them in the tree hash.

Also note that without something like #41373, build artifacts would appear as untracked files in the git repo (that PR was made with other intentions, but it would also ease this).

You can terminate early if you find anything newer than the last build of the package.

For single-file-change updates, that would on average cut the cost in half. When developing several packages that typically do not change, this still incurs the traversal cost each time for those.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, repeated invocations of git status in this context take much less time - only ~3 seconds or so - although I'm not sure if that's on account of FS caching or something smart that Git is doing.

Yep that's FS caching. What does a cold git status look like on a repo with feature.manyFiles=true? Is it faster?

FWIW I assume that whenever Git has to do traversals, it does that as quickly as it can (which is to say, even if we did force git status it would still probably be faster than timestamp checking).

Yes.

For single-file-change updates, that would on average cut the cost in half. When developing several packages that typically do not change, this still incurs the traversal cost each time for those.

I don't think this is completely true, because you're speeding up the traversal just by delegating it to git. One of the issues here is that python is so slow. Does a Python traversal take 13 seconds? Or more?

How fast is doing the tree hash? I don't see how git add -u is going to be too much faster than git status, simply because it has to look at tracked files. That's not necessarily a directory traversal, but it does require git to stat the files to see what tracked files have changed.

Copy link
Copy Markdown
Member Author

@scheibelp scheibelp Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a cold git status, it does not seem to speed things up for Spack

$ git config --global feature.manyFiles true
$ git status
...
It took 14.28 seconds...

However, follow-up invocations of git status take a fraction of a second (much faster than before), so this appears to speed things up sufficiently. I chased a few links to learn more and found https://git-scm.com/docs/git-update-index#_untracked_cache (one of the behaviors enabled by using the manyfiles feature).

Given that:

  • Using git add on new files along with git status seems reasonable from a speed perspective
  • Not all users can enable manyfiles, but we could start with git status by default, and then add an option to disable file tracking as a Spack config option (i.e. users can turn off git status to speed things up)
    • Admittedly telling each user to git config --global feature.manyFiles true would likely be taxing, so we might want to do that automatically as well, where we can (although I'm not sure if we can ever do it globally, since users could have distinct git implementations which are not all compatible with this feature)
  • While it's not strictly wrong to git add build artifacts, I'd want to think about the potential repercussions of that (again, this would also be resolved by spack develop: stage build artifacts in same root as non-dev builds #41373)
    • Note that while CMake packages build out-of-source in general, they don't do this for spack develop packages (or rather, the build directory is a subdir of the source directory, so isolation is achieved, but the build is not truly out-of-source)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would also be resolved by #41373

This PR has been merged, but in fact it only resolves the issue of git adding build artifacts for out-of-source builds (e.g. all CMake packages, so many would work well with this). This PR could potentially constrain its optimizations to such packages.

},
},
"specs": spack.schema.spec_list_schema,
"detect-changes-with-git": {"type": "boolean", "default": False},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why False by default, or why even have an option here? I think it would be better if we just default to git if a develop directory is cloned with git.
  2. If there is a config option for this we shouldn't put it in the env schema -- let's put it in config.yaml (so that it is in an actual config section)... but I'm not sure why it can't just be automatic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only detects changes for tracked files: if the user adds a new file without calling git add, then this feature would fail to detect a change. IMO that meant this should be off by default (since the user has to be aware of a caveat to know when this will work).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note, this conversation was accidentally continued at #41013 (comment))

@scheibelp
Copy link
Copy Markdown
Member Author

I think the following would resolve #41013 (comment)

  • Automatically do git change detection for sources that are git-managed and CMake-based (or out-of-source)
    • This means we can automatically track/record changes for unstaged files without worrying about accidentally tracking binary files
      • (there are probably pathological exceptions to this guard but I'm willing to let them arise "naturally")
  • Add a config option to enable it for git-based repositories that are not out-of-source (with the caveat that they must git add things explicitly)

The main thing I'm trying to avoid in this is tracking changes for binary artifacts generated as part of a build.

@scheibelp scheibelp marked this pull request as draft December 6, 2024 02:17
…dev_src_change and do attr test (this makes it easier to test git detection without causing problems with existing api for that function)
Copy link
Copy Markdown
Contributor

@psakievich psakievich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes will break Sierra. We need to hash this out.

@psakievich psakievich self-requested a review January 24, 2025 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core PR affects Spack core functionality environments stand-alone-tests Stand-alone (or smoke) tests for installed packages tests General test capability(ies)

Projects

Status: In-Progress PRs

Development

Successfully merging this pull request may close these issues.

3 participants