-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Cache a file path filter match result #6047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache a file path filter match result #6047
Conversation
83d3d8f to
3f00f87
Compare
Building on the work in PR git-lfs#6047, we implement a common LRU cache for the Filter structure in our "filepathfilter" package, which is used by all commands that accept command-line or configuration options to filter their actions by file paths. Specifically, any of our commands which accept --include or --exclude options (or their -I and -X short forms), or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options, should benefit from the use of a cache to return previously-determined results when matching file paths against the file path filter patterns. We set the default cache size at 10,000 entries, and provide a new "lfs.pathFilterCacheSize" configuration option which may be used to resize the cache as desired. The special values "none" and "0" disable the cache, and the value "unlimited" allows the cache to grow without bound. Note that to pass the cache size setting to the "filepathfilter" package we rename its "option" type to "Option" so it will be exported and available from our "commands" package. We use the Least-Recently Used (LRU) cache implemention from the "golang/groupcache" Go module, which used a map plus a doubly-linked list to track the most- and least-recently used entries. We add a mutex around our accesses to the cache in case any of our commands use the Allows() method of the Filter structure from concurrent goroutines.
Building on the work in PR git-lfs#6047, we implement a common LRU cache for the Filter structure in our "filepathfilter" package, which is used by all commands that accept command-line or configuration options to filter their actions by file paths. Specifically, any of our commands which accept --include or --exclude options (or their -I and -X short forms), or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options, should benefit from the use of a cache to return previously-determined results when matching file paths against the file path filter patterns. We set the default cache size at 10,000 entries, and provide a new "lfs.pathFilterCacheSize" configuration option which may be used to resize the cache as desired. The special values "none" and "0" disable the cache, and the value "unlimited" allows the cache to grow without bound. Note that to pass the cache size setting to the "filepathfilter" package we rename its "option" type to "Option" so it will be exported and available from our "commands" package. We use the Least-Recently Used (LRU) cache implemention from the "golang/groupcache" Go module, which used a map plus a doubly-linked list to track the most- and least-recently used entries. We add a mutex around our accesses to the cache in case any of our commands use the Allows() method of the Filter structure from concurrent goroutines.
Building on the work in PR git-lfs#6047, we implement a common LRU cache for the Filter structure in our "filepathfilter" package, which is used by all commands that accept command-line or configuration options to filter their actions by file paths. Specifically, any of our commands which accept --include or --exclude options (or their -I and -X short forms), or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options, should benefit from the use of a cache to return previously-determined results when matching file paths against the file path filter patterns. We set the default cache size at 10,000 entries, and provide a new "lfs.pathFilterCacheSize" configuration option which may be used to resize the cache as desired. The special values "none" and "0" disable the cache, and the value "unlimited" allows the cache to grow without bound. Note that to pass the cache size setting to the "filepathfilter" package we rename its "option" type to "Option" so it will be exported and available from our "commands" package. We use the Least-Recently Used (LRU) cache implemention from the "golang/groupcache" Go module, which used a map plus a doubly-linked list to track the most- and least-recently used entries. We add a mutex around our accesses to the cache in case any of our commands use the Allows() method of the Filter structure from concurrent goroutines.
chrisd8088
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this great PR! These kinds of enhancements are really valuable.
After reflecting on this for a few days, I didn't see an obvious reason why we should constrain a potential performance improvement like this to just the git lfs migrate command, or require users to know about a specific command-line option to get the benefit of using the cache.
I did wonder whether we should, by default, provide an upper bound on the size of the cache, though, and if an LRU cache would be of value in this regard so we didn't arbitrarily eject entries from the cache which were frequently used.
To that end, I wrote up some possible further changes in my cache-file-path-filter-match-result branch. If you have some time to take a look at those, that would be fantastic, especially if you're able to run a git-lfs version from that branch against your large repositories to see what the timing differences are from your version or a standard Git LFS client.
(There are more complex LRU implementations out there, but the one I used is relatively straightforward internally and has a compatible license, so I'm hoping it might be sufficient.)
Again, thank you so much for these great ideas and PRs!
filepathfilter/filepathfilter.go
Outdated
| cachedResult, cacheHit := f.cache[filename] | ||
| if cacheHit { | ||
| return cachedResult | ||
| } | ||
|
|
||
| res := f.allowsUncached(filename) | ||
| f.cache[filename] = res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we call the Allows() method from multiple goroutines anywhere, but just to be on the safe side, we may want to add some locking around these cache map reads and writes.
I also wondered about providing an upper bound on the size of the cache, since some migrations of very large repositories with a lot of history may generate significant numbers of distinct file path matches against the filter patterns.
With that in mind I looked into the LRU (Least-Recently Used) cache implementation available in the lru package in the golang/groupcache module, and hacked together an implementation which uses it. From looking into its internals it appears there's a regular map, plus a doubly-linked list used to track which entries have been added or retrieved recently.
Since our use cases would never need to traverse that list, only access it at the head and tail, I think performance should still be relatively decent compared to just managing the map, but I'd love to hear if that's true or not!
I've put my interim, not-yet-complete implementation into the cache-file-path-filter-match-result branch in my chrisd8088/git-lfs repository. It's missing any shell tests in our /t folder and we'll need some to confirm that the parsing of the configuration option functions as expected, at a minimum.
If you are by chance able to compile a git-lfs binary from that branch and give it try with your use cases, and report some timing metrics compared to a stock client and to one with just your initial changes, that would be really fantastic and very much appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we call the Allows() method from multiple goroutines anywhere
Yeah, this is the reason I haven't protected the cache with locks in my implementation.
It's missing any shell tests in our /t folder and we'll need some to confirm that the parsing of the configuration option functions as expected, at a minimum.
Hmmm, I wonder how this could be verified. The impact of the cache won't be noticeable on the repositories that are used in the t-migrate suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are by chance able to compile a git-lfs binary from that branch and give it try with your use cases, and report some timing metrics compared to a stock client and to one with just your initial changes, that would be really fantastic and very much appreciated!
Unmodified git-lfs migrates my large repository in 23min, git-lfs built from this branch does that in 11min, git-lfs built from your branch does that in 13min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for testing my branch with your repository! 🙇
Out of interest, was that with the cache size set to unlimited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, was that with the cache size set to
unlimited?
Yes, this experiment was conducted with the unlimited cache.
Your approach works fine on my large repository if the cache size is unlimited. I'd suggest using
It seems that the performance of the LRU cache is slightly worse than that of a [1] chrisd8088@eebf399#diff-dd529ea69f97da7c4eb30716cc2d3b0a0e2416b122427cae449a44b8510f6baaR25 |
Agree, however, I don't think that the performance improvement will be noticeable anywhere but |
That makes sense and is pretty much what I expected. The LRU cache actually just uses a I wonder if the main extra cost came from the locking? Would you by any chance be able to try a run with the locking removed from my branch? I'm personally pretty comfortable with the slight loss of performance imposed by the extra cache management and locking (more like a ~45% improvement vs. your branch's ~55% improvement), given that those extra features provide some confidence we won't cause problems for ourselves in the future, but let me know if you think otherwise! I am aware that we retain the complete set of commit SHA mappings in memory without any limit, so we have a precedent for that kind of implementation. But in that case, if we wanted to cap the potential memory usage, we'd have to write the map data out to a backing store. Since this change is just a performance boost and a true cache is OK, where items can be ejected and then regenerated and re-inserted as necessary, it seems reasonable to me to use one from the start.
I thought about that too! However, the challenge is that a There's a very similar LRU cache in Google's |
Oh, I see. I think you may merge my and your commit into a single commit. |
OK, thank you! I'll refactor my WIP commit into a few atomic changes and add some shell tests, and push those up. I might need to request an invitation to push to your branch. |
File path pattern matching turns out to be a significant performance bottleneck while migrating a large (> 2 million objects) repository. Caching the result of a file path pattern match (viz. the result of a filepathfilter.Allows() invocation) may speed up LFS migration in such a repository by up to 50%. Moreover, any Git LFS command which accepts --include or --exclude options (or their -I and -X short forms), or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options, should benefit from the use of a cache to return previously-determined results when matching file paths against the file path filter patterns. This patch is a result of collaboration with Chris Darroch who has replaced my map-based cache implementation with LRU groupcache-based one.
…caching The new test is almost identical to TestRewriterIgnoresPathsThatDontMatchFilter, the only difference is that it explicitly enables file path pattern match caching in the filepathfilter instance that it uses.
…caching The patch introduces the new option "lfs.pathFilterCacheSize" configu- ration option which may be used to resize the cache that filepathfiler .Filter uses. The special values "none" and "0" disable the cache, and the value "unlimited" allows the cache to grow without bound. The de- fault cache size is 10,000 entries.
3f00f87 to
42de7cf
Compare
I've incorporated your patch into this branch. |
Thank you very much! That definitely saves me some time and effort! 🙇 I will try to write some shell tests to exercise the cache option setting logic, and then ask for a second opinion from the other @git-lfs/core member, since some of my own contributions are now incorporated into this PR. |
|
OK, I've written a couple of basic expansions of our shell tests to cover the new configuration option in commit 4fed085, which is at the tip of my These checks aren't perfect, but I think they'll be sufficient to at least make sure various possible settings don't cause a crash or other failure. Let me know how they look to you! Once we've added some checks along those lines to this PR, I think we should be good to approve and merge it, I hope! |
In previous commits in this PR we implemented a LRU cache of file paths that have matched against the set of file path filter patterns specified for the current command, which may be any of our commands that accept --include or --exclude options (or their -I and -X short forms) or read the "lfs.fetchInclude" and "lfs.fetchExclude" configuration options. We also implemented a new "lfs.pathFilterCacheSize" configuration option which may be used to resize the cache as desired, or disable it entirely. To validate whether the determineFilepathFilterCache() function we introduced to our "commands" package is able to parse the range of different values we accept for the "lfs.pathFilterCacheSize" option, as well as handle invalid options, we update our shell test suite with a new test in the t/t-ls-files.sh test script and a revised version of another test in the t/t-migrate-import.sh test script. These tests run the respective "git lfs ls-files" and "git lfs migrate import" commands with a number of different settings of the "lfs.pathFilterCacheSize" configuration option and confirm that the same results are produced each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for developing this enhancement and collaborating on the latest implementation!
I am going to ask for a second review from @git-lfs/core because I tinkered with this PR quite a bit and added the LRU cache and related configuration option.
larsxschneider
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thank you!
| if configSize, ok := config.Git.Get("lfs.pathfiltercachesize"); ok { | ||
| switch configSize { | ||
| case "none": | ||
| return filepathfilter.DisableCache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation would a user want to disable this feature?
(no criticism here... just for my own learning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question! I don't think many (any?) folks will want to do this under normal circumstances, but I imagine it might be useful for debugging purposes.
To be honest, the main reason I wrote in the none option is because the groupcache/lru module treats 0 to mean "unlimited", and I felt that was unintuitive to offer through a Git LFS configuration option.
So then we either disallow zero and pick some value to be the minimum cache size (1? 10? 100?), or allow zero but special-case it to disable the cache, which is what I did, as well as mapping the explicit configuration value unlimited back to zero to support that possibility too.
And then, having done that, it just seemed reasonable to offer none as a more readable alternative to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation would a user want to disable this feature?
but I imagine it might be useful for debugging purposes.
Yeah, I think it's the primary reason.
On the other hand the positive impact of the cache may be noticed in a large repository with a complex inclusion/exclusion filter only thus I think some may decide not to waste memory that doesn't noticeably improve the speed on a LFS migration.
File path pattern matching turns out to be a significant performance
bottleneck while migrating a large (> 2 million objects) repository.
Caching the result of a file path pattern match may speed up LFS
migration in such a repository by up to 50%.
The pull request introduces the switch
--cache-file-path-filter-resultsfor
git-lfs-migrateto enable file path pattern match caching.