git: Improve Status() speed with new index.ModTime check#1747
git: Improve Status() speed with new index.ModTime check#1747pjbgf merged 7 commits intogo-git:mainfrom
index.ModTime check#1747Conversation
|
For context, I functionally tried the same in: #1694 This PR surely is more production ready. However, in the time between making my PR and and now, I've looked at upstream git some more. There is are some edge cases that they catch. I think they are needed in the go-git implementation as well. A big one is documented here https://git-scm.com/docs/racy-git |
Oh, sorry, I missed it. I have just been hammering my local benchmark and going at everything that stick out.
I was aware of this issue while reading the git code, but that documentation is a lot more understandable with a lot more context. I had assumed that Golang code in https://github.com/golang/go/blob/master/src/time/time.go#L316 was going to be enough. Seems I need to double check. I wonder how I can write a tests case for this would help ensure we don't have this problem. Thanks a lot for the link. |
e2565f3 to
a0df348
Compare
|
The solution is actually simple, we just need to carry around the mtime of the index file and fallback to compute hash in case mtime of the file == mtime of the index. In practice, that shouldn't really happen very often, so the performance impact shouldn't be a big deal for anyone and we keep the race condition out in a simple manner. |
bb9e85f to
e58bf73
Compare
e58bf73 to
74b08ce
Compare
utils/merkletrie/filesystem/node.go
Outdated
| // Deprecated: Use NewRootNodeWithIndex instead for better performance. | ||
| // This function is kept for backward compatibility and now internally calls | ||
| // NewRootNodeWithIndex with a nil index, which disables the metadata | ||
| // optimization but maintains the same functionality. |
There was a problem hiding this comment.
Should I just keep this function or remove them as this is work for v6 and API break are still acceptable at this point in time?
There was a problem hiding this comment.
Let's keep NewRootNodeWithOptions and add the index in the Options instead.
74b08ce to
b431562
Compare
|
Performance improvement on the provided benchmark: |
e0ac8f6 to
68ad755
Compare
|
I hadn't covered the in memory storage case previously and recent main test improvement added a tests case that caught that. This solve the problem in both a defensive way (making sure that if we don't have a proper index or modification time set on the index, it will fallback to full status check) and then in an optimized way by setting the index time in memory to match the time we load the index. |
f2febfb to
751a53c
Compare
pjbgf
left a comment
There was a problem hiding this comment.
@cedric-appdirect good stuff, thanks for looking into this it largely LGTM. Please rebase and take a look at the changes below.
Thank you @hansbogert for sharing the additional information around the racy index and helping with the initial feedback. 🙇
utils/merkletrie/filesystem/node.go
Outdated
| // Deprecated: Use NewRootNodeWithIndex instead for better performance. | ||
| // This function is kept for backward compatibility and now internally calls | ||
| // NewRootNodeWithIndex with a nil index, which disables the metadata | ||
| // optimization but maintains the same functionality. |
There was a problem hiding this comment.
Let's keep NewRootNodeWithOptions and add the index in the Options instead.
|
For future reference, a |
Add benchmarks to measure Status() performance on repositories with thousands of files. These benchmarks help identify optimization opportunities in the status checking code path. BenchmarkStatusClean measures worst-case performance: a clean repository where every file's hash is computed unnecessarily. This is the primary target for the upcoming metadata-first comparison optimization. BenchmarkStatusModified measures a more realistic case with a small percentage of modified files. BenchmarkStatusLarge tests performance on larger repositories with 5000 files to ensure optimizations scale appropriately.
Store the modification time of the index file in the Index structure. This timestamp is captured when the index is loaded and will be used to properly handle the racy-git condition in the worktree status optimization. The ModTime is obtained by calling Stat() on the open file handle, ensuring it corresponds exactly to the index content that was read. Reference: https://git-scm.com/docs/racy-git
… check Implement metadata-first comparison to avoid unnecessary file hashing when checking status. This matches native git's ie_match_stat() approach. Before hashing a file, check if its metadata (mtime, size, mode) matches the index entry. If metadata matches AND the file's mtime is before the index file's mtime (not in racy-git window), reuse the hash from the index instead of reading and hashing the file content. This optimization dramatically reduces I/O operations for Status() calls on unchanged files: - Clean repository (2000 files): 769ms → 143ms (~5.4x faster) - Repository with 1% modified (2000 files): ~202ms - Large repository (5000 files): ~533ms The optimization includes: - O(1) index entry lookup using a pre-built map - Cached modification time from ReadDir (no extra Lstat calls) - Size, mtime, and mode comparison before any file I/O - Racy-git check using idx.ModTime to ensure correctness This is the same "trust the index" optimization that makes native git status fast. The implementation: 1. Builds an entry map on root node creation for fast lookups 2. Caches file metadata (mtime, size, mode) from ReadDir 3. Compares cached metadata with index entries before hashing 4. Checks for racy-git condition (file mtime >= index mtime) 5. Only hashes files when metadata differs or in racy window The optimization is backward compatible - when idx is nil, the function works without optimization. The old constructors are deprecated but continue to work by calling NewRootNodeWithIndex with nil. Includes test demonstrating proper racy-git handling. Reference: https://git-scm.com/docs/racy-git
The ModTime field added to index.Index is filesystem metadata that gets populated from the file modification time when reading, not from the serialized index content. This means it does not round-trip through SetIndex/Index operations. Update the test to exclude ModTime from the equality comparison since it is expected to differ between the original and read-back index.
… optimization The metadata-first optimization for git status compares file metadata (size, mtime, mode) against the index to avoid expensive content hashing. To ensure correctness, it relies on a "racy git" safety check when files are modified in the same second the index was written. However, with in-memory storage (memory.NewStorage()), idx.ModTime is never set (remains zero), silently skipping the racy git check. On Windows—where time.Now() has coarser resolution (~15ms)—a tracked file could be modified with the same size and mtime, incorrectly appearing unchanged and causing false negatives in status checks. This fix conservatively disables the metadata optimization when the racy git check cannot be performed (idx is nil or idx.ModTime is zero), ensuring correctness. This only affects in-memory storage used in tests; production filesystem storage always has idx.ModTime set and retains the full optimization. Fixes CI test failure in TestStatusCheckedInBeforeIgnored.
The metadata-first optimization relies on racy git detection, which requires idx.ModTime to be set. A prior change disabled this optimization when ModTime was zero to ensure correctness. However, this inadvertently disabled the optimization for all in-memory storage usage. This change re-enables the optimization by making memory storage set ModTime to time.Now() during SetIndex, simulating filesystem storage behavior where ModTime represents the index file's modification time. With ModTime properly set, the racy git check can now function correctly for in-memory storage. Additionally, update the index round-trip test to verify that ModTime is set by SetIndex before zeroing both sides for structural comparison.
… guard Move the index parameter into the Options struct and remove NewRootNodeWithIndex in favor of NewRootNodeWithOptions, addressing PR review feedback. The index can now be passed via options.Index, simplifying the API surface. Also add a nil guard in metadataMatches to safely handle cases where no index entry is available, preventing potential panics.
751a53c to
3c18588
Compare
index.ModTime check
pjbgf
left a comment
There was a problem hiding this comment.
@cedric-appdirect thanks for working on this. 🙇
This should help #181. This mimick what git cli does by relying on file mtime and size to decide if we need to compute its hash. This does not address the untracked file case.