spack develop: use git to detect changes#41013
spack develop: use git to detect changes#41013scheibelp wants to merge 43 commits intospack:developfrom
spack develop: use git to detect changes#41013Conversation
… change without affecting other tests
…arameter (spec.installed bypasses)
…s since been abstracted
spack develop: use git to detect changes
| # Note: originally I thought about returning NOT_CHANGED | ||
| # if the git state is "clean", but technically a user | ||
| # could have initially installed with a dirty state, and | ||
| # then reverted their changes. | ||
| return GitRepoChangeDetector.NO_PRIOR |
There was a problem hiding this comment.
is there a good reason not to just say that this is CHANGED? They're likely to need to build the first time anyway.
There was a problem hiding this comment.
For users who were depending on timestamp detection, this allows for them to use timestamps the first time after pulling this (and also calculate the Git state, so the next time, it can just use Git)
There was a problem hiding this comment.
I'm saying why not build the first time, record that it was built, and calculate the git hash at that time, rather than relying on timestamps?
There was a problem hiding this comment.
We could do that, but to clarify: if we do it that way, we may possibly rebuild unnecessarily the first time we run this logic after pulling the new feature. I decided it would be worthwhile to detect that case and fall back on timestamps, but it sounds like you don't see that as particularly useful.
IMO there's no issue with forcing a rebuild the first time to simplify the logic, but this assumes that someone would in fact find it simpler for us to default to indicating there was a change when we don't have a prior (which IMO is debatable).
Anyhow, I'll assume that I should remove the NO_PRIOR logic.
There was a problem hiding this comment.
Is there a reason the git logic has to be unaware of untracked files? We could use git status --porcelain to see even unknown files in the directory, e.g. in my spack repo right now:
> git status --porcelain
M lib/spack/spack/traverse.py
?? check_all_dicts.py
?? non-const
?? nonconst.pyOne modified file and 3 unknowns -- pretty easy to parse, and git status is probably the fastest way to find this out. You could add the unknowns to the index, which would include them in the tree hash.
Also, it seems like we could use this to avoid a full traversal of the directory. You know that git status reports only what is different from the last commit, and you know the modification date of the last commit. if we find that we have no prior, couldn't we avoid a full traversal by looking at the last commit and the modification date of any files mentioned by git status? You can terminate early if you find anything newer than the last build of the package.
There was a problem hiding this comment.
(this appears to be a response to #41013 (comment), so I'll respond with that in mind)
Is there a reason the git logic has to be unaware of untracked files?
It does not have to be unaware, but my guess is most of the speedup over timestamp checking is achieved by not looking for untracked files.
One modified file and 3 unknowns -- pretty easy to parse,
Parsing is not my concern. I think git status has to do a traversal, and I think that traversal is costly. For example, running git status on Spack's directory in an NFS dir yields:
$ git status
...
It took 13.95 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
FWIW, repeated invocations of git status in this context take much less time - only ~3 seconds or so - although I'm not sure if that's on account of FS caching or something smart that Git is doing.
You know that git status reports only what is different from the last commit, and you know the modification date of the last commit. if we find that we have no prior, couldn't we avoid a full traversal by looking at the last commit and the modification date of any files mentioned by git status?
I don't know how git status itself finds untracked files other than traversing directories, e.g.
git status
touch a/file/deep/within/the/directory/hierarchy
git status # how would Git know where to look?
FWIW I assume that whenever Git has to do traversals, it does that as quickly as it can (which is to say, even if we did force git status it would still probably be faster than timestamp checking).
You could add the unknowns to the index, which would include them in the tree hash.
Also note that without something like #41373, build artifacts would appear as untracked files in the git repo (that PR was made with other intentions, but it would also ease this).
You can terminate early if you find anything newer than the last build of the package.
For single-file-change updates, that would on average cut the cost in half. When developing several packages that typically do not change, this still incurs the traversal cost each time for those.
There was a problem hiding this comment.
FWIW, repeated invocations of git status in this context take much less time - only ~3 seconds or so - although I'm not sure if that's on account of FS caching or something smart that Git is doing.
Yep that's FS caching. What does a cold git status look like on a repo with feature.manyFiles=true? Is it faster?
FWIW I assume that whenever Git has to do traversals, it does that as quickly as it can (which is to say, even if we did force git status it would still probably be faster than timestamp checking).
Yes.
For single-file-change updates, that would on average cut the cost in half. When developing several packages that typically do not change, this still incurs the traversal cost each time for those.
I don't think this is completely true, because you're speeding up the traversal just by delegating it to git. One of the issues here is that python is so slow. Does a Python traversal take 13 seconds? Or more?
How fast is doing the tree hash? I don't see how git add -u is going to be too much faster than git status, simply because it has to look at tracked files. That's not necessarily a directory traversal, but it does require git to stat the files to see what tracked files have changed.
There was a problem hiding this comment.
For a cold git status, it does not seem to speed things up for Spack
$ git config --global feature.manyFiles true
$ git status
...
It took 14.28 seconds...
However, follow-up invocations of git status take a fraction of a second (much faster than before), so this appears to speed things up sufficiently. I chased a few links to learn more and found https://git-scm.com/docs/git-update-index#_untracked_cache (one of the behaviors enabled by using the manyfiles feature).
Given that:
- Using
git addon new files along withgit statusseems reasonable from a speed perspective - Not all users can enable
manyfiles, but we could start withgit statusby default, and then add an option to disable file tracking as a Spack config option (i.e. users can turn offgit statusto speed things up)- Admittedly telling each user to
git config --global feature.manyFiles truewould likely be taxing, so we might want to do that automatically as well, where we can (although I'm not sure if we can ever do it globally, since users could have distinct git implementations which are not all compatible with this feature)
- Admittedly telling each user to
- While it's not strictly wrong to
git addbuild artifacts, I'd want to think about the potential repercussions of that (again, this would also be resolved byspack develop: stage build artifacts in same root as non-dev builds #41373)- Note that while CMake packages build out-of-source in general, they don't do this for
spack developpackages (or rather, the build directory is a subdir of the source directory, so isolation is achieved, but the build is not truly out-of-source)
- Note that while CMake packages build out-of-source in general, they don't do this for
There was a problem hiding this comment.
this would also be resolved by #41373
This PR has been merged, but in fact it only resolves the issue of git adding build artifacts for out-of-source builds (e.g. all CMake packages, so many would work well with this). This PR could potentially constrain its optimizations to such packages.
lib/spack/spack/schema/env.py
Outdated
| }, | ||
| }, | ||
| "specs": spack.schema.spec_list_schema, | ||
| "detect-changes-with-git": {"type": "boolean", "default": False}, |
There was a problem hiding this comment.
- Why
Falseby default, or why even have an option here? I think it would be better if we just default to git if adevelopdirectory is cloned withgit. - If there is a config option for this we shouldn't put it in the env schema -- let's put it in
config.yaml(so that it is in an actual config section)... but I'm not sure why it can't just be automatic.
There was a problem hiding this comment.
This only detects changes for tracked files: if the user adds a new file without calling git add, then this feature would fail to detect a change. IMO that meant this should be off by default (since the user has to be aware of a caveat to know when this will work).
There was a problem hiding this comment.
(note, this conversation was accidentally continued at #41013 (comment))
…at a rebuild is needed; replace enum with boolean
|
I think the following would resolve #41013 (comment)
The main thing I'm trying to avoid in this is tracking changes for binary artifacts generated as part of a build. |
psakievich
left a comment
There was a problem hiding this comment.
New changes will break Sierra. We need to hash this out.
When a user edits in an environment packages marked as develop (using
spack develop), Spack checks the timestamps on all those files to figure out whether a given spec needs a rebuild.For packages that are developed using Git, it is faster to defer to Git's change detection: this PR updates Spack to maintain an additional file inside of the source directory of the developed package to track the prior hash state, and then compares that with the current Git hash to detect whether any files have changed.
In part, this speedup is achieved by avoiding untracked files, and for now is only enabled when using the
detect-changes-with-gitenvironment property (the default is to use timestamps).This borrows some ideas from #35442 (e.g. more-abstract tests that rely on mocking over monkey-patching).