demo: mmap'ed reads#2153
Conversation
before: time /tmp/merge-base ~/vc/linux/.git refs/tags/v7.0 refs/tags/v6.9 a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 real 0m10.530s user 0m8.033s sys 0m3.195s after: $ time /tmp/merge-base ~/vc/linux/.git refs/tags/v7.0 refs/tags/v6.9 a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 real 0m6.716s user 0m7.314s sys 0m0.161s CGit: time git --git-dir ~/vc/linux/.git/ merge-base refs/tags/v6.9 refs/tags/v7.0 a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 real 0m2.948s user 0m2.701s sys 0m0.230s
|
merge_base @ current |
|
There is still space for optimization. Go-git should be able to reach JGit's speed: |
Materialising an `FSObject` returned by `EncodedObject` previously fell back to a direct `o.fs.Open(packPath)` on every `Reader()` call, bypassing the per-pack file-descriptor pool introduced in `billy.Filesystem` that pays an mmap/munmap on each Open, the regression amplified to ~400x more pack opens over 500 reads on the basic fixture[1]. The cure threads the dotgit-cached `*packhandle.PackHandle` into the `*packfile.Packfile` via a new `WithPackHandle` option taking a `PackHandleResolver`. `objectFromHeader` captures the resolver in the `FSObject.acquireRandom` closure, so every read acquires a fresh cursor against the pool-owned `SharedFile`. The resolver re-runs on every `Reader` call, which lets cached FSObjects survive `cleanPackList` invalidations such as the one `NewObjectPack` triggers before a repack. The internal `*packhandle.PackHandle` stays internal — the public `packfile.PackHandle` interface returns `io.ReadSeekCloser` and `RandomReader`, and `DotGit.PackHandle` returns the interface through an unexported adapter so callers can't accidentally call `Close()` on a catalog-owned handle. `Packfile.Close` joins the scanner-close and owned-handle close errors via `errors.Join`. [1]: go-git#2153 (comment) Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
Materialising an `FSObject` returned by `EncodedObject` previously fell back to a direct `o.fs.Open(packPath)` on every `Reader()` call, bypassing the per-pack file-descriptor pool introduced in `billy.Filesystem` that pays an mmap/munmap on each Open, the regression amplified to ~400x more pack opens over 500 reads on the basic fixture[1]. The cure threads the dotgit-cached `*packhandle.PackHandle` into the `*packfile.Packfile` via a new `WithPackHandle` option taking a `PackHandleResolver`. `objectFromHeader` captures the resolver in the `FSObject.acquireRandom` closure, so every read acquires a fresh cursor against the pool-owned `SharedFile`. The resolver re-runs on every `Reader` call, which lets cached FSObjects survive `cleanPackList` invalidations such as the one `NewObjectPack` triggers before a repack. The internal `*packhandle.PackHandle` stays internal — the public `packfile.PackHandle` interface returns `io.ReadSeekCloser` and `RandomReader`, and `DotGit.PackHandle` returns the interface through an unexported adapter so callers can't accidentally call `Close()` on a catalog-owned handle. `Packfile.Close` joins the scanner-close and owned-handle close errors via `errors.Join`. [1]: go-git#2153 (comment) Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
|
Have you looked at https://github.com/go-git/go-git/tree/main/storage/filesystem/mmap? Also mmap is tricky, have a read of https://valyala.medium.com/mmap-in-go-considered-harmful-d92a25cb161d , but basically using mmap in a server process will lead to unpredictable latency . mmap should be optional to not create regression on services that use go-git today. mmap is useful for local and desktop application especially in multi process case. The right approach for go-git will be to look into using io.ReaderAt instead, in my opinion. |
|
This looks somewhat suspect, because NewPackScanner is tied to packs. Reftables could be served out of memory backed storage too, and does not depend on packs.
Point taken, but I'm interested in a locally running process. Also, in the case that the data does fit in RAM, the alternative to mmap'ing data is to load the data into explicitly, which will cause delays on program startup.
IMO, the go-git FS abstraction lacks something |
Materialising an `FSObject` returned by `EncodedObject` previously fell back to a direct `o.fs.Open(packPath)` on every `Reader()` call, bypassing the per-pack file-descriptor pool introduced in `billy.Filesystem` that pays an mmap/munmap on each Open, the regression amplified to ~400x more pack opens over 500 reads on the basic fixture[1]. The cure threads the dotgit-cached `*packhandle.PackHandle` into the `*packfile.Packfile` via a new `WithPackHandle` option taking a `PackHandleResolver`. `objectFromHeader` captures the resolver in the `FSObject.acquireRandom` closure, so every read acquires a fresh cursor against the pool-owned `SharedFile`. The resolver re-runs on every `Reader` call, which lets cached FSObjects survive resets of dotgit's pack-handle store such as the one `NewObjectPack` triggers before a repack. The internal `*packhandle.PackHandle` stays internal — the public `packfile.PackHandle` interface returns `io.ReadSeekCloser` and `RandomReader`, and `DotGit.PackHandle` returns the interface through an unexported adapter so callers can't accidentally call `Close()` on a dotgit-owned handle. `Packfile.Close` releases only the scanner cursor; the resolver-owned handle outlives the Packfile. [1]: go-git#2153 (comment) Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
Materialising an `FSObject` returned by `EncodedObject` previously fell back to a direct `o.fs.Open(packPath)` on every `Reader()` call, bypassing the per-pack file-descriptor pool introduced in `billy.Filesystem` that pays an mmap/munmap on each Open, the regression amplified to ~400x more pack opens over 500 reads on the basic fixture[1]. The cure threads the dotgit-cached `*packhandle.PackHandle` into the `*packfile.Packfile` via a new `WithPackHandle` option taking a `PackHandleResolver`. `objectFromHeader` captures the resolver in the `FSObject.acquireRandom` closure, so every read acquires a fresh cursor against the pool-owned `SharedFile`. The resolver re-runs on every `Reader` call, which lets cached FSObjects survive resets of dotgit's pack-handle store such as the one `NewObjectPack` triggers before a repack. The internal `*packhandle.PackHandle` stays internal — the public `packfile.PackHandle` interface returns `io.ReadSeekCloser` and `RandomReader`, and `DotGit.PackHandle` returns the interface through an unexported adapter so callers can't accidentally call `Close()` on a dotgit-owned handle. `Packfile.Close` releases only the scanner cursor; the resolver-owned handle outlives the Packfile. [1]: go-git#2153 (comment) Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
Materialising an `FSObject` returned by `EncodedObject` previously fell back to a direct `o.fs.Open(packPath)` on every `Reader()` call, bypassing the per-pack file-descriptor pool introduced in `billy.Filesystem` that pays an mmap/munmap on each Open, the regression amplified to ~400x more pack opens over 500 reads on the basic fixture[1]. The cure threads the dotgit-cached `*packhandle.PackHandle` into the `*packfile.Packfile` via a new `WithPackHandle` option taking a `PackHandleResolver`. `objectFromHeader` captures the resolver in the `FSObject.acquireRandom` closure, so every read acquires a fresh cursor against the pool-owned `SharedFile`. The resolver re-runs on every `Reader` call, which lets cached FSObjects survive resets of dotgit's pack-handle store such as the one `NewObjectPack` triggers before a repack. The internal `*packhandle.PackHandle` stays internal — the public `packfile.PackHandle` interface returns `io.ReadSeekCloser` and `RandomReader`, and `DotGit.PackHandle` returns the interface through an unexported adapter so callers can't accidentally call `Close()` on a dotgit-owned handle. `Packfile.Close` releases only the scanner cursor; the resolver-owned handle outlives the Packfile. [1]: go-git#2153 (comment) Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
before:
time /tmp/merge-base ~/vc/linux/.git refs/tags/v7.0 refs/tags/v6.9 a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
real 0m10.530s
user 0m8.033s
sys 0m3.195s
after:
$ time /tmp/merge-base ~/vc/linux/.git refs/tags/v7.0 refs/tags/v6.9 a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
real 0m6.716s
user 0m7.314s
sys 0m0.161s
this change, but osfs.New() with
KeepDescriptors: true$ time /tmp/merge-base ~/vc/linux/.git refs/tags/v7.0 refs/tags/v6.9
a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
real 0m11.368s
user 0m8.909s
sys 0m3.257s
CGit:
time git --git-dir ~/vc/linux/.git/ merge-base refs/tags/v6.9 refs/tags/v7.0 a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
real 0m2.948s
user 0m2.701s
sys 0m0.230s
After packhandle cache (#2132)
time /tmp/merge-base ~/vc/linux/.git refs/tags/v7.0 refs/tags/v6.9
a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
real 0m10.596s
user 0m9.272s
sys 0m2.489s