plumbing: format decoder input bounds and contracts#2119
Merged
Conversation
1c55abf to
ae6888c
Compare
pjbgf
reviewed
May 14, 2026
628d9eb to
f74c5cc
Compare
Apply the size formula canonical Git uses in `packfile.c:load_idx`
[1]: for an idx v2 file with `nr` objects and hash size `hashsz`,
the on-disk length must lie within `[minSize, maxSize]` where
minSize = 8 + 4*256 + nr*(hashsz+8) + 2*hashsz
maxSize = minSize + (nr-1)*8 when nr > 0
Overflow-checked multiplication rejects inputs whose claimed
object count overflows the formula rather than wrapping into a
smaller value.
The on-disk length is taken from `Stat` on the new `Input`
interface, which the `Decoder` now requires in place of a bare
`io.Reader`. `*os.File` and `billy.File` satisfy `Input`
directly.
The `buildOOBOffset64Idx` test fixture is updated to hold two
objects so its on-disk length satisfies the formula; only with
`nr > 1` does it admit any 8-byte offset64 slots. The
`truncated_object_names` case in `TestDecodeErrors` now reports
`ErrMalformedIdxFile` because the size check catches the input
before any reader sees end-of-stream.
[1]: https://github.com/git/git/blob/v2.54.0/packfile.c#L232-L258
Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
f74c5cc to
1b09d04
Compare
pjbgf
reviewed
May 14, 2026
pjbgf
reviewed
May 14, 2026
pjbgf
reviewed
May 14, 2026
1b09d04 to
2ee249b
Compare
A packfile object header carries the object's uncompressed size, and a well-formed packfile never produces more inflated bytes than that value. Enforce this structural invariant at every site that consumes the inflated stream and surface mismatches as `ErrInflatedSizeMismatch`. The streaming and forward-scan paths in `scanner.go`, plus the delta call sites in `parser.go` and `packfile.go`, wrap their write sinks in a new `boundedWriter` that stops accepting input once the declared size is reached. The declared size threaded through varies by site: `oh.Size` for non-delta objects, `oh.Size` for delta objects in `ensureContent` (where it refers to the delta instruction stream, not the resolved result), and `parent.Size` for parent objects in `parentReader`. The lazy `FSObject.Reader` and mmap `ondemandObject.Reader` paths return a new `BoundedReadCloser` that wraps the underlying `zlibReadCloser`. It builds on `io.LimitedReader` with the standard overrun-detection trick — request `limit + 1` bytes and treat the sentinel byte as an explicit signal — so an `io.ReadAll` on an opened object cannot stream past the declared size. These changes do not reject any conforming pack. Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
Replace the iterative single-star backtracker with a faithful port of the recursive `dowild` function in canonical Git's `wildmatch.c` at tag v2.54.0[1]. The two matchers were already aligned on the common cases — both pass the existing conformance suite anchored against `t3070-wildmatch.sh` and the live `git check-ignore` oracle — but the algorithms differed structurally: upstream is a recursive descent that propagates `WM_ABORT_ALL` and `WM_ABORT_TO_STARSTAR` codes back up to prune work at each `*`, while the previous Go implementation backtracked iteratively. The new `dowild(p, text, flags) int` mirrors the C function's control flow line for line: the outer loop walks p and text in lock-step, the `*` case classifies the run of stars and dispatches to the fast-skip-then-recurse loop, the `[` case implements the bracket state machine with literal entries, escapes, ranges, and `[:class:]` headers, and `\` consumes the next character or returns `WM_NOMATCH` when there is none. The bracket-expression helpers `findBracketEnd` and `matchBracket` are subsumed by the inline state machine, matching upstream layout. `wmCasefold` and `wmPathname` flags are implemented for parity with `wildmatch.h` even though go-git's public `Match` does not currently exercise them: the matcher splits paths on `/` before calling `wildmatch`, so each call sees only a single segment of pattern and text, and case-folding is not exposed on the public API. The package-level `globMatch` and `simpleNameMatch` keep their current contracts. POSIX character-class evaluation remains ASCII-only, matching Git's `sane-ctype.h`[2]: bytes with the high bit set never satisfy any class. An unrecognized class name now propagates `WM_ABORT_ALL` up from `matchPOSIXClass` rather than returning a bare `false`, matching the "malformed [:class:] string" branch in upstream. [1]: https://github.com/git/git/blob/v2.54.0/wildmatch.c#L59-L283 [2]: https://github.com/git/git/blob/v2.54.0/sane-ctype.h Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
`Reader` defines a two-phase contract: callers must invoke `Header`
to consume the loose-object type and size fields, after which the
internal `multi` reader and `hasher` are initialised. Calling `Read`
or `Hash` before a successful `Header` accessed uninitialised fields.
Make the contract explicit: `Read` returns the new `ErrHeaderNotRead`
sentinel when `Header` has not been called successfully. `Hash`
returns a zero `plumbing.Hash` carrying the `Reader`'s configured
object format in the same situation — `plumbing.Hash` encodes the
format internally and a bare `plumbing.Hash{}` would be
SHA-1-shaped on a SHA-256 repository. Both methods guard on
`r.multi == nil`, which `prepareForRead` sets on the first
successful `Header` call.
Two new unit tests pin the behaviour: `TestReaderReadBeforeHeader`
runs across both object formats and asserts that `Read` returns
`ErrHeaderNotRead` while `Hash` returns a zero hash sized to the
configured format; `TestReaderReadAfterHeaderError` calls `Read`
after `Header` has returned an error.
Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
2ee249b to
832a692
Compare
hiddeco
added a commit
to hiddeco/go-git
that referenced
this pull request
May 14, 2026
A packfile object header carries the object's uncompressed size, and a well-formed packfile never produces more inflated bytes than that value. Enforce this structural invariant at every site that consumes the inflated stream and surface mismatches as `ErrInflatedSizeMismatch`. Both `Scanner.NextObject` and `Scanner.ReadObject` capture the declared length from the most recent `pendingObject` before clearing it, then thread that bound into `copyObject`. The shared `copyObject(w, declaredSize)` wraps its writer in a new `boundedWriter` when a positive size is supplied, so an overrun returns the legal prefix alongside `ErrInflatedSizeMismatch` rather than being silently appended. Callers in `parser.go` and `packfile.go` go through `NextObject`, so they pick up the bound without further changes; for delta entries the declared length is the size of the delta instruction stream, not the resolved result, which matches what these call sites already pass through. The lazy `FSObject.Reader` big-object path returns a new `boundedReadCloser` that wraps the underlying zlib-backed reader. It builds on `io.LimitedReader` with the standard overrun-detection trick — request `limit + 1` bytes and treat the sentinel byte as an explicit signal — so an `io.ReadAll` on an opened object cannot stream past the declared size. The bound is `o.size`, the resolved object size, which matches what the caller is reading regardless of whether the underlying entry is a non-delta or delta-resolved object. The type and its constructor are unexported for now; they can be promoted once a stable use case outside the package emerges. These changes do not reject any conforming pack. Backport of go-git#2119. Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
hiddeco
added a commit
to hiddeco/go-git
that referenced
this pull request
May 14, 2026
`Reader` defines a two-phase contract: callers must invoke `Header` to consume the loose-object type and size fields, after which the internal `multi` reader and `hasher` are initialised. Calling `Read` or `Hash` before a successful `Header` accessed uninitialised fields and would panic on `r.multi.Read` / `r.hasher.Sum`. Make the contract explicit: `Read` returns the new `ErrHeaderNotRead` sentinel when `Header` has not been called successfully. `Hash` returns `plumbing.ZeroHash` in the same situation. Both methods guard on `r.multi == nil`, which `prepareForRead` sets on the first successful `Header` call. Two unit tests pin the behaviour: `TestReaderReadBeforeHeader` asserts that `Read` returns `ErrHeaderNotRead` and `Hash` returns `ZeroHash`; `TestReaderReadAfterHeaderError` calls `Read` after `Header` has returned an error. Backport of go-git#2119. Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
pjbgf
pushed a commit
that referenced
this pull request
May 14, 2026
Apply the size formula canonical Git uses in `packfile.c:load_idx`
[1]: for an idx v2 file with `nr` objects and hash size `hashsz`,
the on-disk length must lie within `[minSize, maxSize]` where
minSize = 8 + 4*256 + nr*(hashsz+8) + 2*hashsz
maxSize = minSize + (nr-1)*8 when nr > 0
Overflow-checked multiplication rejects inputs whose claimed object
count overflows the formula rather than wrapping into a smaller
value.
To keep the public `NewDecoder(io.Reader)` signature compatible on
v5, the decoder probes for an unexported `statInput` interface
(`Stat() (fs.FileInfo, error)`) at the top of `Decode`. Inputs that
satisfy it have the size formula applied; arbitrary readers retain
the pre-existing behaviour of erroring out at the truncated-payload
boundary. The two production call sites in `storage/filesystem`
pass a `billy.File`; on `osfs` that wraps `*os.File`, so the
formula does run there.
[1]: https://github.com/git/git/blob/v2.54.0/packfile.c#L232-L258
Backport of #2119.
Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four changes align
plumbing/format/*decoder behavior with canonical Git invariants and clarify two internal API contracts.idxfile.Decoder.Decodeports the size-consistency formula frompackfile.c:load_idx1: an idx v2 file withnrobjects (the last fanout entry) and hash sizehashszmust have an on-disk length in[minSize, maxSize]derived fromnrvia overflow-checked multiplication. The decoder now buffers the input at the top ofDecodeand rejects inputs whose declared object count is inconsistent with the file size. All current callers instorage/filesystemand the HTTP dumb transport pass file-backed readers (the dumb transport opens the .idx after it has been downloaded to local storage), so the buffer cost is at most one allocation the size of the .idx file; in practice that is bounded by the size of the underlying pack file.packfile.Scannerenforces the structural invariant that a packfile object's inflated length does not exceed the size declared in its header, at every site that consumes the inflated stream. The streaminginflateContentand forward-scanobjectEntrypaths wrap their write sinks inboundedWriter, which returnsErrInflatedSizeMismatchon overrun. The lazyFSObject.Readerand mmapondemandObject.Readerwrap their returnedzlibReadCloserinBoundedReadCloser, so anio.ReadAllon an opened object cannot stream past the declared size either. The call sites inparser.goandpackfile.gothread the declared size through, including the delta paths where the size refers to the delta instruction stream rather than the resolved result.gitignore.wildmatchis replaced with a port of the recursivedowildfunction in canonical Git'swildmatch.c2 at tag v2.54.0. The previous Go implementation was an iterative backtracker; the port preserves the upstream algorithm exactly — including theWM_ABORT_TO_STARSTARandWM_ABORT_ALLshortcuts, the bracket-expression state machine, and POSIX character-class evaluation — while taking an idiomatic Go shape: string slicing in place of[]byteconversions,strings.IndexByteinstead of a hand-rolled helper, and a regular switch in place of the Cgoto-via-bool pattern. Thet3070-wildmatch.sh-anchored conformance suite passes unchanged.objfile.Reader.ReadandReader.Hashnow require a successfulHeadercall before they will return data, returning the newErrHeaderNotReadsentinel (and a zeroplumbing.Hash{}respectively) otherwise. The two-phase contract —Header, thenRead— is unchanged; this commit makes it explicit at the API surface.