plumbing: object, tree entry validation#2105
Merged
Merged
Conversation
a04f693 to
39421d8
Compare
Tree.Encode previously rejected only embedded NUL bytes and mis-sorted entries, so a caller assembling a *Tree programmatically could produce a tree object containing components like ".git", "..", control characters, or HFS+/NTFS variants of ".git". Run each entry name through pathutil.ValidTreePath at the encoder boundary so the library cannot emit such bytes through its API. The same loop also rejects duplicate names, including the file and directory collision where `foo` and `foo/` resolve to the same sort key. With entries already in memory a single pass over `entry.Name` catches the non-consecutive case that upstream Git's `verify_ordered` in `fsck.c`[1] tracks with a name stack; the simpler form suffices because the file/dir variant shares the bare name. A test helper in worktree_commit_test.go that planted a tree containing ".git/config" for cherry-pick coverage can no longer do that through Tree.Encode and now writes the raw object bytes directly, matching the escape hatch documented in Tree.Encode's godoc. [1]: https://github.com/git/git/blob/v2.54.0/fsck.c#L544-L614 Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
go-git's tree decoder is permissive on purpose: tools that inspect or recover repositories need to read tree objects whose entries do not pass the index-side path-validation gate. Upstream Git keeps the same split — `parse_tree_buffer`[1] is a no-op on names while `fsck_tree`[2] is the validator — and exposes the latter through `git fsck`. Add the read-side counterpart to Tree.Encode's producer gate: a Tree.Validate method that mirrors `fsck_tree`'s structural ruleset. Decode stays as-is; callers that want fsck-shaped reporting call Validate. The ported rules cover null entry hashes, slashes in names, empty names, ".", "..", ".git" with HFS+/NTFS disguises (already handled by pathutil.ValidTreePath), bad file modes, duplicate entries, mis-sort, names longer than 4096 bytes, and the four FSCK_MSG_*_SYMLINK rules for ".gitmodules", ".gitattributes", ".gitignore", and ".mailmap". Two fsck_tree warnings — zero-padded modes and the non-canonical 0100664 bits — are not surfaced: both require retaining the raw octal mode string from the wire, which canonicalTreeMode drops. The structural rules are the load-bearing ones for refusing malformed trees. The new pathutil helpers IsHFSDotGitattributes / IsHFSDotGitignore / IsHFSDotMailmap and their NTFS counterparts extend the existing IsHFSDotGitmodules / IsNTFSDotGitmodules pair with the short-name prefixes upstream uses for the same names (`gi7d29`, `gi250a`, `maba30`). Violations are returned as an errors.Join of individual rule errors, each wrapping ErrInvalidTree, so callers can match either the umbrella or specific rules (ErrDuplicateEntry, ErrEntriesNotSorted, pathutil.ErrInvalidPath) with errors.Is. [1]: https://github.com/git/git/blob/v2.54.0/tree.c#L175-L184 [2]: https://github.com/git/git/blob/v2.54.0/fsck.c#L616-L800 Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
Encode previously enforced a subset of Tree.Validate's rules inline (ValidTreePath, duplicate detection, sort order). With Validate available as the structural ruleset, Encode can run it once at the top and skip the duplicated checks — both shrinking the encoder and widening the producer-side gate to refuse the rules Validate already catches but Encode did not: null entry hashes, oversize names, and symlinks disguised as `.gitmodules` / `.gitattributes` / `.gitignore` / `.mailmap`. The widened gate exposes one pre-existing path that built trees containing entries with the zero hash: buildTreeHelper turned every index entry into a tree entry, including those carried through go-git#1773 to avoid a panic during recursive tree construction. Skip such entries at the top of commitIndexEntry so commit produces a well-formed tree without regressing the panic-safe behaviour the go-git#1773 fix introduced. Assisted-by: Claude Opus 4.7 Signed-off-by: Hidde Beydals <[email protected]>
39421d8 to
5dc6b02
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the encode/decode boundary that #2081 and #2097 left tolerant.
Tree.Encodepreviously accepted entry names like.git,.., control characters, or HFS+/NTFS.gitdisguises, and never checked for duplicates;Tree.Decodedid not validate names at all.Tree.Encodenow runs each name throughpathutil.ValidTreePathand rejects duplicate names — including the file/dir collision wherefooandfoo/share a bare name. Callers that need to emit non-conformant bytes write them viaplumbing.EncodedObjectdirectly.Tree.Decodestays permissive — upstream Git makes the same split betweenparse_tree_bufferandfsck_tree1, and the permissive parse keeps inspection and recovery use cases working. The newTree.Validateis the opt-in read-side counterpart, mirroringfsck_tree's structural rules: null hashes, slashes, empty/./../.gitnames, bad file modes, duplicates, mis-sort, oversized names, and the four*_SYMLINKrules for.gitmodules/.gitattributes/.gitignore/.mailmap. Violations come back aserrors.Joinof rule errors wrappingErrInvalidTree.Two
fsck_treewarnings need the raw octal mode string — zero-padded and non-canonical0100664modes — whichcanonicalTreeModedrops at parse. Those are wire-only and best handled by a future buffer-level validator.