Cache a file path filter match result #6047

alexkad0 · 2025-05-03T22:39:47Z

File path pattern matching turns out to be a significant performance
bottleneck while migrating a large (> 2 million objects) repository.
Caching the result of a file path pattern match may speed up LFS
migration in such a repository by up to 50%.

The pull request introduces the switch --cache-file-path-filter-results
for git-lfs-migrate to enable file path pattern match caching.

Building on the work in PR git-lfs#6047, we implement a common LRU cache for the Filter structure in our "filepathfilter" package, which is used by all commands that accept command-line or configuration options to filter their actions by file paths. Specifically, any of our commands which accept --include or --exclude options (or their -I and -X short forms), or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options, should benefit from the use of a cache to return previously-determined results when matching file paths against the file path filter patterns. We set the default cache size at 10,000 entries, and provide a new "lfs.pathFilterCacheSize" configuration option which may be used to resize the cache as desired. The special values "none" and "0" disable the cache, and the value "unlimited" allows the cache to grow without bound. Note that to pass the cache size setting to the "filepathfilter" package we rename its "option" type to "Option" so it will be exported and available from our "commands" package. We use the Least-Recently Used (LRU) cache implemention from the "golang/groupcache" Go module, which used a map plus a doubly-linked list to track the most- and least-recently used entries. We add a mutex around our accesses to the cache in case any of our commands use the Allows() method of the Filter structure from concurrent goroutines.

chrisd8088

Thank you so much for this great PR! These kinds of enhancements are really valuable.

After reflecting on this for a few days, I didn't see an obvious reason why we should constrain a potential performance improvement like this to just the git lfs migrate command, or require users to know about a specific command-line option to get the benefit of using the cache.

I did wonder whether we should, by default, provide an upper bound on the size of the cache, though, and if an LRU cache would be of value in this regard so we didn't arbitrarily eject entries from the cache which were frequently used.

To that end, I wrote up some possible further changes in my cache-file-path-filter-match-result branch. If you have some time to take a look at those, that would be fantastic, especially if you're able to run a git-lfs version from that branch against your large repositories to see what the timing differences are from your version or a standard Git LFS client.

(There are more complex LRU implementations out there, but the one I used is relatively straightforward internally and has a compatible license, so I'm hoping it might be sufficient.)

Again, thank you so much for these great ideas and PRs!

filepathfilter/filepathfilter.go

commands/commands.go

docs/man/git-lfs-migrate.adoc

chrisd8088 · 2025-05-09T08:56:57Z

filepathfilter/filepathfilter.go

+	cachedResult, cacheHit := f.cache[filename]
+	if cacheHit {
+		return cachedResult
+	}
+
+	res := f.allowsUncached(filename)
+	f.cache[filename] = res


I don't think we call the Allows() method from multiple goroutines anywhere, but just to be on the safe side, we may want to add some locking around these cache map reads and writes.

I also wondered about providing an upper bound on the size of the cache, since some migrations of very large repositories with a lot of history may generate significant numbers of distinct file path matches against the filter patterns.

With that in mind I looked into the LRU (Least-Recently Used) cache implementation available in the lru package in the golang/groupcache module, and hacked together an implementation which uses it. From looking into its internals it appears there's a regular map, plus a doubly-linked list used to track which entries have been added or retrieved recently.

Since our use cases would never need to traverse that list, only access it at the head and tail, I think performance should still be relatively decent compared to just managing the map, but I'd love to hear if that's true or not!

I've put my interim, not-yet-complete implementation into the cache-file-path-filter-match-result branch in my chrisd8088/git-lfs repository. It's missing any shell tests in our /t folder and we'll need some to confirm that the parsing of the configuration option functions as expected, at a minimum.

If you are by chance able to compile a git-lfs binary from that branch and give it try with your use cases, and report some timing metrics compared to a stock client and to one with just your initial changes, that would be really fantastic and very much appreciated!

I don't think we call the Allows() method from multiple goroutines anywhere

Yeah, this is the reason I haven't protected the cache with locks in my implementation.

It's missing any shell tests in our /t folder and we'll need some to confirm that the parsing of the configuration option functions as expected, at a minimum.

Hmmm, I wonder how this could be verified. The impact of the cache won't be noticeable on the repositories that are used in the t-migrate suite.

If you are by chance able to compile a git-lfs binary from that branch and give it try with your use cases, and report some timing metrics compared to a stock client and to one with just your initial changes, that would be really fantastic and very much appreciated!

Unmodified git-lfs migrates my large repository in 23min, git-lfs built from this branch does that in 11min, git-lfs built from your branch does that in 13min.

Thank you very much for testing my branch with your repository! 🙇

Out of interest, was that with the cache size set to unlimited?

Out of interest, was that with the cache size set to unlimited?

Yes, this experiment was conducted with the unlimited cache.

git/githistory/rewriter_test.go

alexkad0 · 2025-05-10T14:40:10Z

To that end, I wrote up some possible further changes in my cache-file-path-filter-match-result branch. If you have some time to take a look at those, that would be fantastic, especially if you're able to run a git-lfs version from that branch against your large repositories to see what the timing differences are from your version or a standard Git LFS client.

Your approach works fine on my large repository if the cache size is unlimited. I'd suggest using sync.RWMutex instead of sync.Mutex in your implementation [1].

(There are more complex LRU implementations out there, but the one I used is relatively straightforward internally and has a compatible license, so I'm hoping it might be sufficient.)

It seems that the performance of the LRU cache is slightly worse than that of a map in my workload.

[1] chrisd8088@eebf399#diff-dd529ea69f97da7c4eb30716cc2d3b0a0e2416b122427cae449a44b8510f6baaR25

alexkad0 · 2025-05-10T15:07:45Z

After reflecting on this for a few days, I didn't see an obvious reason why we should constrain a potential performance improvement like this to just the git lfs migrate command, or require users to know about a specific command-line option to get the benefit of using the cache.

Agree, however, I don't think that the performance improvement will be noticeable anywhere but git-lfs-migrate.

chrisd8088 · 2025-05-13T20:06:04Z

It seems that the performance of the LRU cache is slightly worse than that of a map in my workload.

That makes sense and is pretty much what I expected. The LRU cache actually just uses a map internally, but also pushes and pops things from the head and tail of a doubly-linked list.

I wonder if the main extra cost came from the locking? Would you by any chance be able to try a run with the locking removed from my branch?

I'm personally pretty comfortable with the slight loss of performance imposed by the extra cache management and locking (more like a ~45% improvement vs. your branch's ~55% improvement), given that those extra features provide some confidence we won't cause problems for ourselves in the future, but let me know if you think otherwise!

I am aware that we retain the complete set of commit SHA mappings in memory without any limit, so we have a precedent for that kind of implementation. But in that case, if we wanted to cap the potential memory usage, we'd have to write the map data out to a backing store. Since this change is just a performance boost and a true cache is OK, where items can be ejected and then regenerated and re-inserted as necessary, it seems reasonable to me to use one from the start.

I'd suggest using sync.RWMutex instead of sync.Mutex in your implementation

I thought about that too! However, the challenge is that a Get() operation is not just a simple read, because it also updates the doubly-linked list to move the just-read item to the head of the list. So it's effectively a write operation too.

There's a very similar LRU cache in Google's golang.org/x/build/internal/lru package which includes locking logic within its Get() method, and they use a sync.Mutex for the same reason, I presume.

alexkad0 · 2025-05-14T07:38:58Z

'd suggest using sync.RWMutex instead of sync.Mutex in your implementation

I thought about that too! However, the challenge is that a Get() operation is not just a simple read, because it also updates the doubly-linked list to move the just-read item to the head of the list. So it's effectively a write operation too.

Oh, I see.

I think you may merge my and your commit into a single commit.

chrisd8088 · 2025-05-15T15:35:55Z

I think you may merge my and your commit into a single commit.

OK, thank you! I'll refactor my WIP commit into a few atomic changes and add some shell tests, and push those up. I might need to request an invitation to push to your branch.

File path pattern matching turns out to be a significant performance bottleneck while migrating a large (> 2 million objects) repository. Caching the result of a file path pattern match (viz. the result of a filepathfilter.Allows() invocation) may speed up LFS migration in such a repository by up to 50%. Moreover, any Git LFS command which accepts --include or --exclude options (or their -I and -X short forms), or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options, should benefit from the use of a cache to return previously-determined results when matching file paths against the file path filter patterns. This patch is a result of collaboration with Chris Darroch who has replaced my map-based cache implementation with LRU groupcache-based one.

…caching The new test is almost identical to TestRewriterIgnoresPathsThatDontMatchFilter, the only difference is that it explicitly enables file path pattern match caching in the filepathfilter instance that it uses.

…caching The patch introduces the new option "lfs.pathFilterCacheSize" configu- ration option which may be used to resize the cache that filepathfiler .Filter uses. The special values "none" and "0" disable the cache, and the value "unlimited" allows the cache to grow without bound. The de- fault cache size is 10,000 entries.

alexkad0 · 2025-05-17T13:56:19Z

I think you may merge my and your commit into a single commit.

OK, thank you! I'll refactor my WIP commit into a few atomic changes and add some shell tests, and push those up. I might need to request an invitation to push to your branch.

I've incorporated your patch into this branch.

chrisd8088 · 2025-05-19T17:01:14Z

I've incorporated your patch into this branch.

Thank you very much! That definitely saves me some time and effort! 🙇

I will try to write some shell tests to exercise the cache option setting logic, and then ask for a second opinion from the other @git-lfs/core member, since some of my own contributions are now incorporated into this PR.

chrisd8088 · 2025-05-20T04:31:14Z

OK, I've written a couple of basic expansions of our shell tests to cover the new configuration option in commit 4fed085, which is at the tip of my cache-file-path-filter-match-result-tests branch right now.

These checks aren't perfect, but I think they'll be sufficient to at least make sure various possible settings don't cause a crash or other failure. Let me know how they look to you!

Once we've added some checks along those lines to this PR, I think we should be good to approve and merge it, I hope!

In previous commits in this PR we implemented a LRU cache of file paths that have matched against the set of file path filter patterns specified for the current command, which may be any of our commands that accept --include or --exclude options (or their -I and -X short forms) or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options. We also implemented a new "lfs.pathFilterCacheSize" configuration option which may be used to resize the cache as desired, or disable it entirely. To validate whether the determineFilepathFilterCache() function we introduced to our "commands" package is able to parse the range of different values we accept for the "lfs.pathFilterCacheSize" option, as well as handle invalid options, we update our shell test suite with a new test in the t/t-ls-files.sh test script and a revised version of another test in the t/t-migrate-import.sh test script. These tests run the respective "git lfs ls-files" and "git lfs migrate import" commands with a number of different settings of the "lfs.pathFilterCacheSize" configuration option and confirm that the same results are produced each time.

chrisd8088

Thanks again for developing this enhancement and collaborating on the latest implementation!

I am going to ask for a second review from @git-lfs/core because I tinkered with this PR quite a bit and added the LRU cache and related configuration option.

larsxschneider

This looks great! Thank you!

larsxschneider · 2025-06-05T11:50:55Z

commands/commands.go

+	if configSize, ok := config.Git.Get("lfs.pathfiltercachesize"); ok {
+		switch configSize {
+		case "none":
+			return filepathfilter.DisableCache()


In what situation would a user want to disable this feature?

(no criticism here... just for my own learning)

That's a good question! I don't think many (any?) folks will want to do this under normal circumstances, but I imagine it might be useful for debugging purposes.

To be honest, the main reason I wrote in the none option is because the groupcache/lru module treats 0 to mean "unlimited", and I felt that was unintuitive to offer through a Git LFS configuration option.

So then we either disallow zero and pick some value to be the minimum cache size (1? 10? 100?), or allow zero but special-case it to disable the cache, which is what I did, as well as mapping the explicit configuration value unlimited back to zero to support that possibility too.

And then, having done that, it just seemed reasonable to offer none as a more readable alternative to 0.

In what situation would a user want to disable this feature?

but I imagine it might be useful for debugging purposes.

Yeah, I think it's the primary reason.

On the other hand the positive impact of the cache may be noticed in a large repository with a complex inclusion/exclusion filter only thus I think some may decide not to waste memory that doesn't noticeably improve the speed on a LFS migration.

alexkad0 requested a review from a team as a code owner May 3, 2025 22:39

alexkad0 force-pushed the cache-file-path-filter-match-result branch 2 times, most recently from 83d3d8f to 3f00f87 Compare May 3, 2025 23:00

chrisd8088 added enhancement migration Related to `git lfs migrate` labels May 8, 2025

chrisd8088 reviewed May 9, 2025

View reviewed changes

alexkad0 added 4 commits May 17, 2025 14:51

Export the type "Option" from the package "filepathfilter"

b6ea9dd

alexkad0 force-pushed the cache-file-path-filter-match-result branch from 3f00f87 to 42de7cf Compare May 17, 2025 13:54

chrisd8088 added 2 commits May 22, 2025 15:21

Merge branch 'main' into cache-file-path-filter-match-result

b400a51

chrisd8088 approved these changes May 22, 2025

View reviewed changes

larsxschneider approved these changes Jun 5, 2025

View reviewed changes

Merge branch 'main' into cache-file-path-filter-match-result

4f31c86

chrisd8088 merged commit 9e751d1 into git-lfs:main Jun 5, 2025
10 checks passed

This comment was marked as spam.

Sign in to view

JaclynCodes mentioned this pull request Sep 11, 2025

Add building icon git-lfs/lfs-test-server#129

Closed

This comment was marked as spam.

Sign in to view

Cache a file path filter match result #6047

Cache a file path filter match result #6047

Uh oh!

Conversation

alexkad0 commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisd8088 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisd8088 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

alexkad0 May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexkad0 May 10, 2025

Choose a reason for hiding this comment

Uh oh!

chrisd8088 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

alexkad0 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexkad0 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexkad0 commented May 10, 2025

Uh oh!

chrisd8088 commented May 13, 2025

Uh oh!

alexkad0 commented May 14, 2025

Uh oh!

chrisd8088 commented May 15, 2025

Uh oh!

alexkad0 commented May 17, 2025

Uh oh!

chrisd8088 commented May 19, 2025

Uh oh!

chrisd8088 commented May 20, 2025

Uh oh!

chrisd8088 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larsxschneider left a comment

Choose a reason for hiding this comment

Uh oh!

larsxschneider Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

chrisd8088 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

alexkad0 Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexkad0 commented May 3, 2025 •

edited

Loading

alexkad0 May 10, 2025 •

edited

Loading

alexkad0 commented May 10, 2025 •

edited

Loading

chrisd8088 left a comment •

edited

Loading