-
Notifications
You must be signed in to change notification settings - Fork 2.2k
git: improve sparse checkout support #5796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @bk2204, I have tested this in my local repo setup, and it seems to be working well with sparse checkouts. Do you have a plan to merge this anytime soon? |
db7c470 to
f456413
Compare
|
My apologies, this slipped off my radar. I've rebased this PR to make CI run again, and assuming it goes green, I'll mark it for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thank you! I'd love to squeeze this into the upcoming v3.6.0 release.
My only reservation is that we don't have many tests using sparse checkouts (just a pair of tests in one script), and none which exercise this change in particular.
Of course we have lots of tests of git lfs checkout and git lfs pull, the only callers of the ScanLFSFiles() function, which is the only code path by which the LsFilesLFS() function is used, so we know this change it doesn't break anything, which is the most important point to confirm.
But it would also be nice to validate that it has the desired effect of increasing the efficiency of the git ls-files command when a user has a sparse checkout with sparse index.
I managed to write one test, which I could push to this PR's branch, or if you'd prefer to just include it in a commit, that's cool too. I figured it might go into the t/t-pull.sh test script. I've confirmed that it fails without the changes from this PR.
I'll try to craft a similar version for the t/t-checkout.sh script which exercises the git lfs checkout command instead of git lfs pull, unless you beat me to it!
begin_test "pull with partial clone and sparse checkout"
(
set -e
# Only test with Git version 2.42.0 as it introduced support for the
# "objecttype" format option to the "git ls-files" command, which our
# code requires.
ensure_git_version_isnt "$VERSION_LOWER" "2.42.0"
reponame="pull-sparse"
setup_remote_repo "$reponame"
clone_repo "$reponame" "$reponame"
git lfs track "*.dat"
contents1="a"
contents1_oid=$(calc_oid "$contents1")
contents2="b"
contents2_oid=$(calc_oid "$contents2")
contents3="c"
contents3_oid=$(calc_oid "$contents3")
mkdir in out
printf "%s" "$contents1" > a.dat
printf "%s" "$contents2" > in/b.dat
printf "%s" "$contents3" > out/c.dat
git add .
git commit -m "add files"
git push origin main
assert_server_object "$reponame" "$contents1_oid"
assert_server_object "$reponame" "$contents2_oid"
assert_server_object "$reponame" "$contents3_oid"
# Create a partial clone with a cone-mode sparse checkout of one directory
# and a sparse index, which is important because otherwise the "git ls-files"
# command ignores the --sparse option and lists all Git LFS files.
cd ..
git clone --filter=tree:0 --depth=1 --no-checkout \
"$GITSERVER/$reponame" "${reponame}-partial"
cd "${reponame}-partial"
git sparse-checkout init --cone --sparse-index
git sparse-checkout set in
git checkout main
[ -d "in" ]
[ ! -e "out" ]
assert_local_object "$contents1_oid" 1
assert_local_object "$contents2_oid" 1
refute_local_object "$contents3_oid"
git lfs pull 2>&1 | tee pull.log
grep -q "Downloading LFS objects" pull.log && exit 1
# Git LFS objects associated with files outside of the sparse cone
# should not have been pulled.
refute_local_object "$contents3_oid"
)
end_test
6fa7936 to
ec98486
Compare
|
Just FYI, I've made some adjustments to my proposed test above, which I wrote quite late last night. I've put those into the comment above. I'm also working on a version for |
|
I've squashed that test into the PR with a co-author credit and added a test for |
ec98486 to
a8bbe16
Compare
a8bbe16 to
f227692
Compare
When we invoke `git ls-files` to try to find all LFS files, we don't honour sparse file paths or exclusions. While we should never actually traverse excluded files, using the `--exclude-standard` option can avoid loading some data with filtered clones, which may result in less data being downloaded. In addition, we can honour sparse checkouts, since this code path is only used to handle the working tree and we know that the only files we need to consider are those Git actually put in the working tree. The `--sparse` option is new in 2.35, but we already require 2.42 above, so we can use it unconditionally. Co-authored-by: Chris Darroch <[email protected]>
f227692 to
5aa7be5
Compare
chrisd8088
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for collaborating on some tests for this PR!
I'm going to take the slightly unorthodox step of approving this PR without a second review from another @git-lfs/core team member, in the interest of trying to get to a v3.6.0 release PR quite soon after this is merged.
I think that's OK because (a) I only contributed to the additional tests in this PR, not the change to the client's code, (b) the tests pass CI, (c) we've both looked at the tests and made revisions, and (d) I've checked that each of the relevant assertions in the tests fails as expected if the client code change is removed, not just the first such assertion in each test.
Hi @chrisd8088 May I know when v3.6.0 is gonna released? |
In commit 5aa7be5 of PR #5796 we added tests of the sparse checkout support provided by our "git lfs checkout" and "git lfs pull" commands, which makes use of the "git ls-files" command and the --sparse option that was introduced for that command in Git v2.35.0. In practice, the "git lfs checkout" and "git lfs pull" commands require Git v2.42.0 or higher to be available before they invoke "git ls-files", and otherwise fall back to using the "git ls-tree" command. We require at least Git v2.42.0 because that version introduced support for the "objecttype" field name in the "git ls-files" command's --format option and we depend on that field to be able to mimic the output format of the "git ls-tree" command with the "git ls-files" command. We noted these details in commit beae114 of PR #5699, when we revised the runScanLFSFiles() function in our "lfs" package to choose between the use of "git ls-files" and "git ls-tree". One difference between the "git ls-files" and "git ls-tree" commands, however, is that the former lists the files in the Git index (since we always pass the --cached option) while the latter lists the files in the Git tree associated with a given reference, which in the case of our "git lfs checkout" and "git lfs pull" commands is always the current "HEAD" symbolic reference. As a consequence, as discussed in issue #6004, if certain files are absent from the current working tree and Git index as the result of a partial clone or sparse checkout, the behaviour of the "git lfs checkout" and "git lfs pull" commands varies depending on the installed version of Git. If Git v2.42.0 or higher is installed, the "git lfs checkout" and "git lfs pull" commands invoke the "git ls-files" command and provide an "attr:filter=lfs" pathspec so the Git command will filter out files which do not match a Git LFS filter attribute. However, in order to be reported, Git LFS pointer files must exist in the Git index; if they only appear in the working tree or the Git tree associated with the "HEAD" reference, they will be ignored. (Note that in a non-bare repository, the "git ls-files" command will only match the "attr:filter=lfs" pathspec against attributes defined in ".gitattributes" files in the index or working tree, plus any local files such as the "$GIT_DIR/info/attributes" file. Any ".gitattributes" files that are present only in the Git tree associated with the "HEAD" reference will not be consulted. In a bare repository, meanwhile, the "git ls-files" command will by default not match the pathspec against attributes defined in ".gitattributes" files at all, regardless of whether such files exist in the index or in the tree referenced by "HEAD".) If a version of Git older than v2.42.0 is installed and so the "git ls-tree" command is invoked instead of the "git ls-files" command, then a full list of the files in the tree-ish referenced by "HEAD" is returned. The "git lfs checkout" and "git lfs pull" commands will then attempt to check out the Git LFS objects associated with all the Git LFS pointer files found in this list. In the case of the "git lfs pull" command, it will also try to fetch those objects if they are not already present in the local storage directories. (Note, though, that when the "git lfs checkout" and "git lfs pull" commands retrieve a list of files using the "git ls-tree" command, they do not check whether the pointer files they find in that list actually match Git LFS filter attributes in any ".gitattributes" or other Git attributes files. So a user may remove all the ".gitattributes" files from their working tree and index, commit those changes to "HEAD", and the Git LFS commands will still attempt to check out objects for any files found in the "HEAD" commit's tree that can be parsed as valid pointers. When the "git ls-files" command is used instead of the "git ls-tree" command to retrieve a file list, this legacy behaviour does not occur, because the "attr:filter=lfs" pathspec requires that the "git ls-files" command only return a list of files which match at least one Git LFS filter attribute.) In subsequent commits we will alter how the "git lfs checkout" and "git lfs pull" commands operate within bare repositories and how they handle file paths, including by changing the current working directory to the root of the current working tree, if one is present. Of necessity, our tests and documentation will also be expanded to reflect the variable behaviour of the "git lfs pull" command in particular, since its effects in a bare repository depend in part on the installed version of Git. Before we make these changes, we first revise our existing tests of the "git lfs checkout" and "git lfs pull" commands with partial clones and sparse checkouts so that the tests confirm the key differences in behaviour when the installed version of Git is v2.42.0 or higher. Our tests now demonstrate that with an older version of Git, objects will be fetched (in the case of the "git lfs pull" command) and checked out for all Git LFS files, including those outside the configured sparse cone. We also update the manual pages for these commands to include an explanation of how their operation varies depending on the installed version of Git, how this may affect repositories with partial clones and sparse checkouts, and the options available to users if they find the "git lfs checkout" and "git lfs pull" commands appear to be ignoring certain files. As well, we edit the initial section in our git-lfs-pull(1) manual page where we incorrectly state that the command is always equivalent to running "git lfs fetch" followed by "git lfs checkout", and fix the formatting of the example commands provided in this section. When we converted our manual page source files from the Ronn format to AsciiDoc in commit 0c66dcf of PR #5054, the two example commands in this section were accidentally merged onto a single line, and the "<remote>" option for the "git lfs fetch" command was elided. We therefore restore the original version of these two example commands and add leading shell prompt indicators to further clarify that the example includes two separate commands.
In commit 5aa7be5 of PR #5796 we added tests of the sparse checkout support provided by our "git lfs checkout" and "git lfs pull" commands, which makes use of the "git ls-files" command and the --sparse option that was introduced for that command in Git v2.35.0. In practice, the "git lfs checkout" and "git lfs pull" commands require Git v2.42.0 or higher to be available before they invoke "git ls-files", and otherwise fall back to using the "git ls-tree" command. We require at least Git v2.42.0 because that version introduced support for the "objecttype" field name in the "git ls-files" command's --format option and we depend on that field to be able to mimic the output format of the "git ls-tree" command with the "git ls-files" command. We noted these details in commit beae114 of PR #5699, when we revised the runScanLFSFiles() function in our "lfs" package to choose between the use of "git ls-files" and "git ls-tree". One difference between the "git ls-files" and "git ls-tree" commands, however, is that the former lists the files in the Git index (since we always pass the --cached option) while the latter lists the files in the Git tree associated with a given reference, which in the case of our "git lfs checkout" and "git lfs pull" commands is always the current "HEAD" symbolic reference. As a consequence, as discussed in issue #6004, if certain files are absent from the current working tree and Git index as the result of a partial clone or sparse checkout, the behaviour of the "git lfs checkout" and "git lfs pull" commands varies depending on the installed version of Git. If Git v2.42.0 or higher is installed, the "git lfs checkout" and "git lfs pull" commands invoke the "git ls-files" command and provide an "attr:filter=lfs" pathspec so the Git command will filter out files which do not match a Git LFS filter attribute. However, in order to be reported, Git LFS pointer files must exist in the Git index; if they only appear in the working tree or the Git tree associated with the "HEAD" reference, they will be ignored. (Note that in a non-bare repository, the "git ls-files" command will only match the "attr:filter=lfs" pathspec against attributes defined in ".gitattributes" files in the index or working tree, plus any local files such as the "$GIT_DIR/info/attributes" file. Any ".gitattributes" files that are present only in the Git tree associated with the "HEAD" reference will not be consulted. In a bare repository, meanwhile, the "git ls-files" command will by default not match the pathspec against attributes defined in ".gitattributes" files at all, regardless of whether such files exist in the index or in the tree referenced by "HEAD".) If a version of Git older than v2.42.0 is installed and so the "git ls-tree" command is invoked instead of the "git ls-files" command, then a full list of the files in the tree-ish referenced by "HEAD" is returned. The "git lfs checkout" and "git lfs pull" commands will then attempt to check out the Git LFS objects associated with all the Git LFS pointer files found in this list. In the case of the "git lfs pull" command, it will also try to fetch those objects if they are not already present in the local storage directories. (Note, though, that when the "git lfs checkout" and "git lfs pull" commands retrieve a list of files using the "git ls-tree" command, they do not check whether the pointer files they find in that list actually match Git LFS filter attributes in any ".gitattributes" or other Git attributes files. So a user may remove all the ".gitattributes" files from their working tree and index, commit those changes to "HEAD", and the Git LFS commands will still attempt to check out objects for any files found in the "HEAD" commit's tree that can be parsed as valid pointers. When the "git ls-files" command is used instead of the "git ls-tree" command to retrieve a file list, this legacy behaviour does not occur, because the "attr:filter=lfs" pathspec requires that the "git ls-files" command only return a list of files which match at least one Git LFS filter attribute.) In subsequent commits we will alter how the "git lfs checkout" and "git lfs pull" commands operate within bare repositories and how they handle file paths, including by changing the current working directory to the root of the current working tree, if one is present. Of necessity, our tests and documentation will also be expanded to reflect the variable behaviour of the "git lfs pull" command in particular, since its effects in a bare repository depend in part on the installed version of Git. Before we make these changes, we first revise our existing tests of the "git lfs checkout" and "git lfs pull" commands with partial clones and sparse checkouts so that the tests confirm the key differences in behaviour when the installed version of Git is v2.42.0 or higher. Our tests now demonstrate that with an older version of Git, objects will be fetched (in the case of the "git lfs pull" command) and checked out for all Git LFS files, including those outside the configured sparse cone. We also update the manual pages for these commands to include an explanation of how their operation varies depending on the installed version of Git, how this may affect repositories with partial clones and sparse checkouts, and the options available to users if they find the "git lfs checkout" and "git lfs pull" commands appear to be ignoring certain files. As well, we edit the initial section in our git-lfs-pull(1) manual page where we incorrectly state that the command is always equivalent to running "git lfs fetch" followed by "git lfs checkout", and fix the formatting of the example commands provided in this section. When we converted our manual page source files from the Ronn format to AsciiDoc in commit 0c66dcf of PR #5054, the two example commands in this section were accidentally merged onto a single line, and the "<remote>" option for the "git lfs fetch" command was elided. We therefore restore the original version of these two example commands and add leading shell prompt indicators to further clarify that the example includes two separate commands.
In commit 5aa7be5 of PR git-lfs#5796 we added tests of the sparse checkout support provided by our "git lfs checkout" and "git lfs pull" commands, which makes use of the "git ls-files" command and the --sparse option that was introduced for that command in Git v2.35.0. In practice, the "git lfs checkout" and "git lfs pull" commands require Git v2.42.0 or higher to be available before they invoke "git ls-files", and otherwise fall back to using the "git ls-tree" command. We require at least Git v2.42.0 because that version introduced support for the "objecttype" field name in the "git ls-files" command's --format option and we depend on that field to be able to mimic the output format of the "git ls-tree" command with the "git ls-files" command. We noted these details in commit beae114 of PR git-lfs#5699, when we revised the runScanLFSFiles() function in our "lfs" package to choose between the use of "git ls-files" and "git ls-tree". One difference between the "git ls-files" and "git ls-tree" commands, however, is that the former lists the files in the Git index (since we always pass the --cached option) while the latter lists the files in the Git tree associated with a given reference, which in the case of our "git lfs checkout" and "git lfs pull" commands is always the current "HEAD" symbolic reference. As a consequence, as discussed in issue git-lfs#6004, if certain files are absent from the current working tree and Git index as the result of a partial clone or sparse checkout, the behaviour of the "git lfs checkout" and "git lfs pull" commands varies depending on the installed version of Git. If Git v2.42.0 or higher is installed, the "git lfs checkout" and "git lfs pull" commands invoke the "git ls-files" command and provide an "attr:filter=lfs" pathspec so the Git command will filter out files which do not match a Git LFS filter attribute. However, in order to be reported, Git LFS pointer files must exist in the Git index; if they only appear in the working tree or the Git tree associated with the "HEAD" reference, they will be ignored. (Note that in a non-bare repository, the "git ls-files" command will only match the "attr:filter=lfs" pathspec against attributes defined in ".gitattributes" files in the index or working tree, plus any local files such as the "$GIT_DIR/info/attributes" file. Any ".gitattributes" files that are present only in the Git tree associated with the "HEAD" reference will not be consulted. In a bare repository, meanwhile, the "git ls-files" command will by default not match the pathspec against attributes defined in ".gitattributes" files at all, regardless of whether such files exist in the index or in the tree referenced by "HEAD".) If a version of Git older than v2.42.0 is installed and so the "git ls-tree" command is invoked instead of the "git ls-files" command, then a full list of the files in the tree-ish referenced by "HEAD" is returned. The "git lfs checkout" and "git lfs pull" commands will then attempt to check out the Git LFS objects associated with all the Git LFS pointer files found in this list. In the case of the "git lfs pull" command, it will also try to fetch those objects if they are not already present in the local storage directories. (Note, though, that when the "git lfs checkout" and "git lfs pull" commands retrieve a list of files using the "git ls-tree" command, they do not check whether the pointer files they find in that list actually match Git LFS filter attributes in any ".gitattributes" or other Git attributes files. So a user may remove all the ".gitattributes" files from their working tree and index, commit those changes to "HEAD", and the Git LFS commands will still attempt to check out objects for any files found in the "HEAD" commit's tree that can be parsed as valid pointers. When the "git ls-files" command is used instead of the "git ls-tree" command to retrieve a file list, this legacy behaviour does not occur, because the "attr:filter=lfs" pathspec requires that the "git ls-files" command only return a list of files which match at least one Git LFS filter attribute.) In subsequent commits we will alter how the "git lfs checkout" and "git lfs pull" commands operate within bare repositories and how they handle file paths, including by changing the current working directory to the root of the current working tree, if one is present. Of necessity, our tests and documentation will also be expanded to reflect the variable behaviour of the "git lfs pull" command in particular, since its effects in a bare repository depend in part on the installed version of Git. Before we make these changes, we first revise our existing tests of the "git lfs checkout" and "git lfs pull" commands with partial clones and sparse checkouts so that the tests confirm the key differences in behaviour when the installed version of Git is v2.42.0 or higher. Our tests now demonstrate that with an older version of Git, objects will be fetched (in the case of the "git lfs pull" command) and checked out for all Git LFS files, including those outside the configured sparse cone. We also update the manual pages for these commands to include an explanation of how their operation varies depending on the installed version of Git, how this may affect repositories with partial clones and sparse checkouts, and the options available to users if they find the "git lfs checkout" and "git lfs pull" commands appear to be ignoring certain files. As well, we edit the initial section in our git-lfs-pull(1) manual page where we incorrectly state that the command is always equivalent to running "git lfs fetch" followed by "git lfs checkout", and fix the formatting of the example commands provided in this section. When we converted our manual page source files from the Ronn format to AsciiDoc in commit 0c66dcf of PR git-lfs#5054, the two example commands in this section were accidentally merged onto a single line, and the "<remote>" option for the "git lfs fetch" command was elided. We therefore restore the original version of these two example commands and add leading shell prompt indicators to further clarify that the example includes two separate commands.
When we invoke
git ls-filesto try to find all LFS files, we don't honour sparse file paths or exclusions. While we should never actually traverse excluded files, using the--exclude-standardoption can avoid loading some data with filtered clones, which may result in less data being downloaded.In addition, we can honour sparse checkouts, since this code path is only used to handle the working tree and we know that the only files we need to consider are those Git actually put in the working tree. The
--sparseoption is new in 2.35, but we already require 2.42 above, so we can use it unconditionally.