Skip to content

plumbing: format decoder input bounds and contracts#2119

Merged
pjbgf merged 4 commits into
go-git:mainfrom
hiddeco:format-input-bounds
May 14, 2026
Merged

plumbing: format decoder input bounds and contracts#2119
pjbgf merged 4 commits into
go-git:mainfrom
hiddeco:format-input-bounds

Conversation

@hiddeco
Copy link
Copy Markdown
Member

@hiddeco hiddeco commented May 13, 2026

Four changes align plumbing/format/* decoder behavior with canonical Git invariants and clarify two internal API contracts.

idxfile.Decoder.Decode ports the size-consistency formula from packfile.c:load_idx1: an idx v2 file with nr objects (the last fanout entry) and hash size hashsz must have an on-disk length in [minSize, maxSize] derived from nr via overflow-checked multiplication. The decoder now buffers the input at the top of Decode and rejects inputs whose declared object count is inconsistent with the file size. All current callers in storage/filesystem and the HTTP dumb transport pass file-backed readers (the dumb transport opens the .idx after it has been downloaded to local storage), so the buffer cost is at most one allocation the size of the .idx file; in practice that is bounded by the size of the underlying pack file.

packfile.Scanner enforces the structural invariant that a packfile object's inflated length does not exceed the size declared in its header, at every site that consumes the inflated stream. The streaming inflateContent and forward-scan objectEntry paths wrap their write sinks in boundedWriter, which returns ErrInflatedSizeMismatch on overrun. The lazy FSObject.Reader and mmap ondemandObject.Reader wrap their returned zlibReadCloser in BoundedReadCloser, so an io.ReadAll on an opened object cannot stream past the declared size either. The call sites in parser.go and packfile.go thread the declared size through, including the delta paths where the size refers to the delta instruction stream rather than the resolved result.

gitignore.wildmatch is replaced with a port of the recursive dowild function in canonical Git's wildmatch.c2 at tag v2.54.0. The previous Go implementation was an iterative backtracker; the port preserves the upstream algorithm exactly — including the WM_ABORT_TO_STARSTAR and WM_ABORT_ALL shortcuts, the bracket-expression state machine, and POSIX character-class evaluation — while taking an idiomatic Go shape: string slicing in place of []byte conversions, strings.IndexByte instead of a hand-rolled helper, and a regular switch in place of the C goto-via-bool pattern. The t3070-wildmatch.sh-anchored conformance suite passes unchanged.

objfile.Reader.Read and Reader.Hash now require a successful Header call before they will return data, returning the new ErrHeaderNotRead sentinel (and a zero plumbing.Hash{} respectively) otherwise. The two-phase contract — Header, then Read — is unchanged; this commit makes it explicit at the API surface.

@hiddeco hiddeco force-pushed the format-input-bounds branch 3 times, most recently from 1c55abf to ae6888c Compare May 13, 2026 21:44
@hiddeco hiddeco marked this pull request as ready for review May 13, 2026 21:51
Comment thread plumbing/format/objfile/reader.go Outdated
@hiddeco hiddeco force-pushed the format-input-bounds branch 3 times, most recently from 628d9eb to f74c5cc Compare May 14, 2026 11:37
Apply the size formula canonical Git uses in `packfile.c:load_idx`
[1]: for an idx v2 file with `nr` objects and hash size `hashsz`,
the on-disk length must lie within `[minSize, maxSize]` where

    minSize = 8 + 4*256 + nr*(hashsz+8) + 2*hashsz
    maxSize = minSize + (nr-1)*8     when nr > 0

Overflow-checked multiplication rejects inputs whose claimed
object count overflows the formula rather than wrapping into a
smaller value.

The on-disk length is taken from `Stat` on the new `Input`
interface, which the `Decoder` now requires in place of a bare
`io.Reader`. `*os.File` and `billy.File` satisfy `Input`
directly.

The `buildOOBOffset64Idx` test fixture is updated to hold two
objects so its on-disk length satisfies the formula; only with
`nr > 1` does it admit any 8-byte offset64 slots. The
`truncated_object_names` case in `TestDecodeErrors` now reports
`ErrMalformedIdxFile` because the size check catches the input
before any reader sees end-of-stream.

[1]: https://github.com/git/git/blob/v2.54.0/packfile.c#L232-L258

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
@hiddeco hiddeco force-pushed the format-input-bounds branch from f74c5cc to 1b09d04 Compare May 14, 2026 13:04
Comment thread plumbing/format/packfile/scanner.go
Comment thread plumbing/format/packfile/scanner.go Outdated
Comment thread plumbing/format/packfile/scanner.go Outdated
@hiddeco hiddeco force-pushed the format-input-bounds branch from 1b09d04 to 2ee249b Compare May 14, 2026 13:31
hiddeco added 3 commits May 14, 2026 15:34
A packfile object header carries the object's uncompressed size, and
a well-formed packfile never produces more inflated bytes than that
value. Enforce this structural invariant at every site that consumes
the inflated stream and surface mismatches as
`ErrInflatedSizeMismatch`.

The streaming and forward-scan paths in `scanner.go`, plus the delta
call sites in `parser.go` and `packfile.go`, wrap their write sinks
in a new `boundedWriter` that stops accepting input once the declared
size is reached. The declared size threaded through varies by site:
`oh.Size` for non-delta objects, `oh.Size` for delta objects in
`ensureContent` (where it refers to the delta instruction stream,
not the resolved result), and `parent.Size` for parent objects in
`parentReader`.

The lazy `FSObject.Reader` and mmap `ondemandObject.Reader` paths
return a new `BoundedReadCloser` that wraps the underlying
`zlibReadCloser`. It builds on `io.LimitedReader` with the standard
overrun-detection trick — request `limit + 1` bytes and treat the
sentinel byte as an explicit signal — so an `io.ReadAll` on an
opened object cannot stream past the declared size.

These changes do not reject any conforming pack.

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
Replace the iterative single-star backtracker with a faithful port
of the recursive `dowild` function in canonical Git's `wildmatch.c`
at tag v2.54.0[1]. The two matchers were already aligned on the
common cases — both pass the existing conformance suite anchored
against `t3070-wildmatch.sh` and the live `git check-ignore`
oracle — but the algorithms differed structurally: upstream is a
recursive descent that propagates `WM_ABORT_ALL` and
`WM_ABORT_TO_STARSTAR` codes back up to prune work at each `*`,
while the previous Go implementation backtracked iteratively.

The new `dowild(p, text, flags) int` mirrors the C function's
control flow line for line: the outer loop walks p and text in
lock-step, the `*` case classifies the run of stars and dispatches
to the fast-skip-then-recurse loop, the `[` case implements the
bracket state machine with literal entries, escapes, ranges, and
`[:class:]` headers, and `\` consumes the next character or
returns `WM_NOMATCH` when there is none. The bracket-expression
helpers `findBracketEnd` and `matchBracket` are subsumed by the
inline state machine, matching upstream layout.

`wmCasefold` and `wmPathname` flags are implemented for parity
with `wildmatch.h` even though go-git's public `Match` does not
currently exercise them: the matcher splits paths on `/` before
calling `wildmatch`, so each call sees only a single segment of
pattern and text, and case-folding is not exposed on the public
API. The package-level `globMatch` and `simpleNameMatch` keep
their current contracts.

POSIX character-class evaluation remains ASCII-only, matching
Git's `sane-ctype.h`[2]: bytes with the high bit set never
satisfy any class. An unrecognized class name now propagates
`WM_ABORT_ALL` up from `matchPOSIXClass` rather than returning a
bare `false`, matching the "malformed [:class:] string" branch in
upstream.

[1]: https://github.com/git/git/blob/v2.54.0/wildmatch.c#L59-L283
[2]: https://github.com/git/git/blob/v2.54.0/sane-ctype.h

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
`Reader` defines a two-phase contract: callers must invoke `Header`
to consume the loose-object type and size fields, after which the
internal `multi` reader and `hasher` are initialised. Calling `Read`
or `Hash` before a successful `Header` accessed uninitialised fields.

Make the contract explicit: `Read` returns the new `ErrHeaderNotRead`
sentinel when `Header` has not been called successfully. `Hash`
returns a zero `plumbing.Hash` carrying the `Reader`'s configured
object format in the same situation — `plumbing.Hash` encodes the
format internally and a bare `plumbing.Hash{}` would be
SHA-1-shaped on a SHA-256 repository. Both methods guard on
`r.multi == nil`, which `prepareForRead` sets on the first
successful `Header` call.

Two new unit tests pin the behaviour: `TestReaderReadBeforeHeader`
runs across both object formats and asserts that `Read` returns
`ErrHeaderNotRead` while `Hash` returns a zero hash sized to the
configured format; `TestReaderReadAfterHeaderError` calls `Read`
after `Header` has returned an error.

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
@hiddeco hiddeco force-pushed the format-input-bounds branch from 2ee249b to 832a692 Compare May 14, 2026 13:34
Copy link
Copy Markdown
Member

@pjbgf pjbgf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hiddeco Thanks for working on this. 🙇

@pjbgf pjbgf merged commit d4d24cb into go-git:main May 14, 2026
17 checks passed
@hiddeco hiddeco deleted the format-input-bounds branch May 14, 2026 14:00
hiddeco added a commit to hiddeco/go-git that referenced this pull request May 14, 2026
A packfile object header carries the object's uncompressed size, and
a well-formed packfile never produces more inflated bytes than that
value. Enforce this structural invariant at every site that consumes
the inflated stream and surface mismatches as
`ErrInflatedSizeMismatch`.

Both `Scanner.NextObject` and `Scanner.ReadObject` capture the
declared length from the most recent `pendingObject` before clearing
it, then thread that bound into `copyObject`. The shared
`copyObject(w, declaredSize)` wraps its writer in a new
`boundedWriter` when a positive size is supplied, so an overrun
returns the legal prefix alongside `ErrInflatedSizeMismatch` rather
than being silently appended. Callers in `parser.go` and
`packfile.go` go through `NextObject`, so they pick up the bound
without further changes; for delta entries the declared length is
the size of the delta instruction stream, not the resolved result,
which matches what these call sites already pass through.

The lazy `FSObject.Reader` big-object path returns a new
`boundedReadCloser` that wraps the underlying zlib-backed reader.
It builds on `io.LimitedReader` with the standard overrun-detection
trick — request `limit + 1` bytes and treat the sentinel byte as an
explicit signal — so an `io.ReadAll` on an opened object cannot
stream past the declared size. The bound is `o.size`, the resolved
object size, which matches what the caller is reading regardless of
whether the underlying entry is a non-delta or delta-resolved
object. The type and its constructor are unexported for now; they
can be promoted once a stable use case outside the package emerges.

These changes do not reject any conforming pack.

Backport of go-git#2119.

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
hiddeco added a commit to hiddeco/go-git that referenced this pull request May 14, 2026
`Reader` defines a two-phase contract: callers must invoke `Header`
to consume the loose-object type and size fields, after which the
internal `multi` reader and `hasher` are initialised. Calling `Read`
or `Hash` before a successful `Header` accessed uninitialised
fields and would panic on `r.multi.Read` / `r.hasher.Sum`.

Make the contract explicit: `Read` returns the new `ErrHeaderNotRead`
sentinel when `Header` has not been called successfully. `Hash`
returns `plumbing.ZeroHash` in the same situation. Both methods
guard on `r.multi == nil`, which `prepareForRead` sets on the first
successful `Header` call.

Two unit tests pin the behaviour: `TestReaderReadBeforeHeader`
asserts that `Read` returns `ErrHeaderNotRead` and `Hash` returns
`ZeroHash`; `TestReaderReadAfterHeaderError` calls `Read` after
`Header` has returned an error.

Backport of go-git#2119.

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
pjbgf pushed a commit that referenced this pull request May 14, 2026
Apply the size formula canonical Git uses in `packfile.c:load_idx`
[1]: for an idx v2 file with `nr` objects and hash size `hashsz`,
the on-disk length must lie within `[minSize, maxSize]` where

    minSize = 8 + 4*256 + nr*(hashsz+8) + 2*hashsz
    maxSize = minSize + (nr-1)*8     when nr > 0

Overflow-checked multiplication rejects inputs whose claimed object
count overflows the formula rather than wrapping into a smaller
value.

To keep the public `NewDecoder(io.Reader)` signature compatible on
v5, the decoder probes for an unexported `statInput` interface
(`Stat() (fs.FileInfo, error)`) at the top of `Decode`. Inputs that
satisfy it have the size formula applied; arbitrary readers retain
the pre-existing behaviour of erroring out at the truncated-payload
boundary. The two production call sites in `storage/filesystem`
pass a `billy.File`; on `osfs` that wraps `*os.File`, so the
formula does run there.

[1]: https://github.com/git/git/blob/v2.54.0/packfile.c#L232-L258

Backport of #2119.

Assisted-by: Claude Opus 4.7
Signed-off-by: Hidde Beydals <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants