Skip to content

Conversation

@alexkad0
Copy link
Contributor

@alexkad0 alexkad0 commented May 3, 2025

File path pattern matching turns out to be a significant performance
bottleneck while migrating a large (> 2 million objects) repository.
Caching the result of a file path pattern match may speed up LFS
migration in such a repository by up to 50%.

The pull request introduces the switch --cache-file-path-filter-results
for git-lfs-migrate to enable file path pattern match caching.

@alexkad0 alexkad0 requested a review from a team as a code owner May 3, 2025 22:39
@alexkad0 alexkad0 force-pushed the cache-file-path-filter-match-result branch 2 times, most recently from 83d3d8f to 3f00f87 Compare May 3, 2025 23:00
@chrisd8088 chrisd8088 added enhancement migration Related to `git lfs migrate` labels May 8, 2025
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request May 9, 2025
Building on the work in PR git-lfs#6047, we implement a common LRU cache
for the Filter structure in our "filepathfilter" package, which is
used by all commands that accept command-line or configuration
options to filter their actions by file paths.

Specifically, any of our commands which accept --include or --exclude
options (or their -I and -X short forms), or read the "lfs.fetchInclude"
and "lfs.fetchExclude" configuration options, should benefit from
the use of a cache to return previously-determined results when
matching file paths against the file path filter patterns.

We set the default cache size at 10,000 entries, and provide a new
"lfs.pathFilterCacheSize" configuration option which may be used to
resize the cache as desired.  The special values "none" and "0" disable
the cache, and the value "unlimited" allows the cache to grow without
bound.

Note that to pass the cache size setting to the "filepathfilter"
package we rename its "option" type to "Option" so it will be
exported and available from our "commands" package.

We use the Least-Recently Used (LRU) cache implemention from the
"golang/groupcache" Go module, which used a map plus a doubly-linked
list to track the most- and least-recently used entries.  We add a
mutex around our accesses to the cache in case any of our commands
use the Allows() method of the Filter structure from concurrent
goroutines.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request May 9, 2025
Building on the work in PR git-lfs#6047, we implement a common LRU cache
for the Filter structure in our "filepathfilter" package, which is
used by all commands that accept command-line or configuration
options to filter their actions by file paths.

Specifically, any of our commands which accept --include or --exclude
options (or their -I and -X short forms), or read the "lfs.fetchInclude"
and "lfs.fetchExclude" configuration options, should benefit from
the use of a cache to return previously-determined results when
matching file paths against the file path filter patterns.

We set the default cache size at 10,000 entries, and provide a new
"lfs.pathFilterCacheSize" configuration option which may be used to
resize the cache as desired.  The special values "none" and "0" disable
the cache, and the value "unlimited" allows the cache to grow without
bound.

Note that to pass the cache size setting to the "filepathfilter"
package we rename its "option" type to "Option" so it will be
exported and available from our "commands" package.

We use the Least-Recently Used (LRU) cache implemention from the
"golang/groupcache" Go module, which used a map plus a doubly-linked
list to track the most- and least-recently used entries.  We add a
mutex around our accesses to the cache in case any of our commands
use the Allows() method of the Filter structure from concurrent
goroutines.
chrisd8088 added a commit to chrisd8088/git-lfs that referenced this pull request May 9, 2025
Building on the work in PR git-lfs#6047, we implement a common LRU cache
for the Filter structure in our "filepathfilter" package, which is
used by all commands that accept command-line or configuration
options to filter their actions by file paths.

Specifically, any of our commands which accept --include or --exclude
options (or their -I and -X short forms), or read the "lfs.fetchInclude"
and "lfs.fetchExclude" configuration options, should benefit from
the use of a cache to return previously-determined results when
matching file paths against the file path filter patterns.

We set the default cache size at 10,000 entries, and provide a new
"lfs.pathFilterCacheSize" configuration option which may be used to
resize the cache as desired.  The special values "none" and "0" disable
the cache, and the value "unlimited" allows the cache to grow without
bound.

Note that to pass the cache size setting to the "filepathfilter"
package we rename its "option" type to "Option" so it will be
exported and available from our "commands" package.

We use the Least-Recently Used (LRU) cache implemention from the
"golang/groupcache" Go module, which used a map plus a doubly-linked
list to track the most- and least-recently used entries.  We add a
mutex around our accesses to the cache in case any of our commands
use the Allows() method of the Filter structure from concurrent
goroutines.
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this great PR! These kinds of enhancements are really valuable.

After reflecting on this for a few days, I didn't see an obvious reason why we should constrain a potential performance improvement like this to just the git lfs migrate command, or require users to know about a specific command-line option to get the benefit of using the cache.

I did wonder whether we should, by default, provide an upper bound on the size of the cache, though, and if an LRU cache would be of value in this regard so we didn't arbitrarily eject entries from the cache which were frequently used.

To that end, I wrote up some possible further changes in my cache-file-path-filter-match-result branch. If you have some time to take a look at those, that would be fantastic, especially if you're able to run a git-lfs version from that branch against your large repositories to see what the timing differences are from your version or a standard Git LFS client.

(There are more complex LRU implementations out there, but the one I used is relatively straightforward internally and has a compatible license, so I'm hoping it might be sufficient.)

Again, thank you so much for these great ideas and PRs!

Comment on lines 142 to 148
cachedResult, cacheHit := f.cache[filename]
if cacheHit {
return cachedResult
}

res := f.allowsUncached(filename)
f.cache[filename] = res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we call the Allows() method from multiple goroutines anywhere, but just to be on the safe side, we may want to add some locking around these cache map reads and writes.

I also wondered about providing an upper bound on the size of the cache, since some migrations of very large repositories with a lot of history may generate significant numbers of distinct file path matches against the filter patterns.

With that in mind I looked into the LRU (Least-Recently Used) cache implementation available in the lru package in the golang/groupcache module, and hacked together an implementation which uses it. From looking into its internals it appears there's a regular map, plus a doubly-linked list used to track which entries have been added or retrieved recently.

Since our use cases would never need to traverse that list, only access it at the head and tail, I think performance should still be relatively decent compared to just managing the map, but I'd love to hear if that's true or not!

I've put my interim, not-yet-complete implementation into the cache-file-path-filter-match-result branch in my chrisd8088/git-lfs repository. It's missing any shell tests in our /t folder and we'll need some to confirm that the parsing of the configuration option functions as expected, at a minimum.

If you are by chance able to compile a git-lfs binary from that branch and give it try with your use cases, and report some timing metrics compared to a stock client and to one with just your initial changes, that would be really fantastic and very much appreciated!

Copy link
Contributor Author

@alexkad0 alexkad0 May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we call the Allows() method from multiple goroutines anywhere

Yeah, this is the reason I haven't protected the cache with locks in my implementation.

It's missing any shell tests in our /t folder and we'll need some to confirm that the parsing of the configuration option functions as expected, at a minimum.

Hmmm, I wonder how this could be verified. The impact of the cache won't be noticeable on the repositories that are used in the t-migrate suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are by chance able to compile a git-lfs binary from that branch and give it try with your use cases, and report some timing metrics compared to a stock client and to one with just your initial changes, that would be really fantastic and very much appreciated!

Unmodified git-lfs migrates my large repository in 23min, git-lfs built from this branch does that in 11min, git-lfs built from your branch does that in 13min.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for testing my branch with your repository! 🙇

Out of interest, was that with the cache size set to unlimited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, was that with the cache size set to unlimited?

Yes, this experiment was conducted with the unlimited cache.

@alexkad0
Copy link
Contributor Author

alexkad0 commented May 10, 2025

To that end, I wrote up some possible further changes in my cache-file-path-filter-match-result branch. If you have some time to take a look at those, that would be fantastic, especially if you're able to run a git-lfs version from that branch against your large repositories to see what the timing differences are from your version or a standard Git LFS client.

Your approach works fine on my large repository if the cache size is unlimited. I'd suggest using sync.RWMutex instead of sync.Mutex in your implementation [1].

(There are more complex LRU implementations out there, but the one I used is relatively straightforward internally and has a compatible license, so I'm hoping it might be sufficient.)

It seems that the performance of the LRU cache is slightly worse than that of a map in my workload.

[1] chrisd8088@eebf399#diff-dd529ea69f97da7c4eb30716cc2d3b0a0e2416b122427cae449a44b8510f6baaR25

@alexkad0
Copy link
Contributor Author

After reflecting on this for a few days, I didn't see an obvious reason why we should constrain a potential performance improvement like this to just the git lfs migrate command, or require users to know about a specific command-line option to get the benefit of using the cache.

Agree, however, I don't think that the performance improvement will be noticeable anywhere but git-lfs-migrate.

@chrisd8088
Copy link
Member

It seems that the performance of the LRU cache is slightly worse than that of a map in my workload.

That makes sense and is pretty much what I expected. The LRU cache actually just uses a map internally, but also pushes and pops things from the head and tail of a doubly-linked list.

I wonder if the main extra cost came from the locking? Would you by any chance be able to try a run with the locking removed from my branch?

I'm personally pretty comfortable with the slight loss of performance imposed by the extra cache management and locking (more like a ~45% improvement vs. your branch's ~55% improvement), given that those extra features provide some confidence we won't cause problems for ourselves in the future, but let me know if you think otherwise!

I am aware that we retain the complete set of commit SHA mappings in memory without any limit, so we have a precedent for that kind of implementation. But in that case, if we wanted to cap the potential memory usage, we'd have to write the map data out to a backing store. Since this change is just a performance boost and a true cache is OK, where items can be ejected and then regenerated and re-inserted as necessary, it seems reasonable to me to use one from the start.

I'd suggest using sync.RWMutex instead of sync.Mutex in your implementation

I thought about that too! However, the challenge is that a Get() operation is not just a simple read, because it also updates the doubly-linked list to move the just-read item to the head of the list. So it's effectively a write operation too.

There's a very similar LRU cache in Google's golang.org/x/build/internal/lru package which includes locking logic within its Get() method, and they use a sync.Mutex for the same reason, I presume.

@alexkad0
Copy link
Contributor Author

'd suggest using sync.RWMutex instead of sync.Mutex in your implementation

I thought about that too! However, the challenge is that a Get() operation is not just a simple read, because it also updates the doubly-linked list to move the just-read item to the head of the list. So it's effectively a write operation too.

Oh, I see.

I think you may merge my and your commit into a single commit.

@chrisd8088
Copy link
Member

I think you may merge my and your commit into a single commit.

OK, thank you! I'll refactor my WIP commit into a few atomic changes and add some shell tests, and push those up. I might need to request an invitation to push to your branch.

alexkad0 added 4 commits May 17, 2025 14:51
File path pattern matching turns out to be a significant performance
bottleneck while migrating a large (> 2 million objects) repository.
Caching the result of a file path pattern match (viz. the result of
a filepathfilter.Allows() invocation) may speed up LFS migration in
such a repository by up to 50%.

Moreover, any Git LFS command which accepts --include or --exclude
options (or their -I and -X short forms), or read the "lfs.fetchInclude"
and "lfs.fetchExclude" configuration options, should benefit from
the use of a cache to return previously-determined results when matching
file paths against the file path filter patterns.

This patch is a result of collaboration with Chris Darroch who has
replaced my map-based cache implementation with LRU groupcache-based
one.
…caching

The new test is almost identical to TestRewriterIgnoresPathsThatDontMatchFilter,
the only difference is that it explicitly enables file path pattern match caching
in the filepathfilter instance that it uses.
…caching

The patch introduces the new option "lfs.pathFilterCacheSize" configu-
ration option which may be used to resize the cache that filepathfiler
.Filter uses. The special values "none" and "0" disable the cache, and
the value "unlimited" allows the cache to grow without bound. The de-
fault cache size is 10,000 entries.
@alexkad0 alexkad0 force-pushed the cache-file-path-filter-match-result branch from 3f00f87 to 42de7cf Compare May 17, 2025 13:54
@alexkad0
Copy link
Contributor Author

I think you may merge my and your commit into a single commit.

OK, thank you! I'll refactor my WIP commit into a few atomic changes and add some shell tests, and push those up. I might need to request an invitation to push to your branch.

I've incorporated your patch into this branch.

@chrisd8088
Copy link
Member

I've incorporated your patch into this branch.

Thank you very much! That definitely saves me some time and effort! 🙇

I will try to write some shell tests to exercise the cache option setting logic, and then ask for a second opinion from the other @git-lfs/core member, since some of my own contributions are now incorporated into this PR.

@chrisd8088
Copy link
Member

OK, I've written a couple of basic expansions of our shell tests to cover the new configuration option in commit 4fed085, which is at the tip of my cache-file-path-filter-match-result-tests branch right now.

These checks aren't perfect, but I think they'll be sufficient to at least make sure various possible settings don't cause a crash or other failure. Let me know how they look to you!

Once we've added some checks along those lines to this PR, I think we should be good to approve and merge it, I hope!

In previous commits in this PR we implemented a LRU cache of file paths
that have matched against the set of file path filter patterns specified
for the current command, which may be any of our commands that accept
--include or --exclude options (or their -I and -X short forms) or read
the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options.

We also implemented a new "lfs.pathFilterCacheSize" configuration option
which may be used to resize the cache as desired, or disable it entirely.

To validate whether the determineFilepathFilterCache() function we
introduced to our "commands" package is able to parse the range of
different values we accept for the "lfs.pathFilterCacheSize" option,
as well as handle invalid options, we update our shell test suite with
a new test in the t/t-ls-files.sh test script and a revised version
of another test in the t/t-migrate-import.sh test script.

These tests run the respective "git lfs ls-files" and "git lfs migrate
import" commands with a number of different settings of the
"lfs.pathFilterCacheSize" configuration option and confirm that the
same results are produced each time.
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for developing this enhancement and collaborating on the latest implementation!

I am going to ask for a second review from @git-lfs/core because I tinkered with this PR quite a bit and added the LRU cache and related configuration option.

Copy link
Member

@larsxschneider larsxschneider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you!

if configSize, ok := config.Git.Get("lfs.pathfiltercachesize"); ok {
switch configSize {
case "none":
return filepathfilter.DisableCache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would a user want to disable this feature?

(no criticism here... just for my own learning)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question! I don't think many (any?) folks will want to do this under normal circumstances, but I imagine it might be useful for debugging purposes.

To be honest, the main reason I wrote in the none option is because the groupcache/lru module treats 0 to mean "unlimited", and I felt that was unintuitive to offer through a Git LFS configuration option.

So then we either disallow zero and pick some value to be the minimum cache size (1? 10? 100?), or allow zero but special-case it to disable the cache, which is what I did, as well as mapping the explicit configuration value unlimited back to zero to support that possibility too.

And then, having done that, it just seemed reasonable to offer none as a more readable alternative to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation would a user want to disable this feature?

but I imagine it might be useful for debugging purposes.

Yeah, I think it's the primary reason.

On the other hand the positive impact of the cache may be noticed in a large repository with a complex inclusion/exclusion filter only thus I think some may decide not to waste memory that doesn't noticeably improve the speed on a LFS migration.

@chrisd8088 chrisd8088 merged commit 9e751d1 into git-lfs:main Jun 5, 2025
10 checks passed
JaclynCodes

This comment was marked as spam.

JaclynCodes

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement migration Related to `git lfs migrate`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants