Skip to content

Conversation

@bk2204
Copy link
Member

@bk2204 bk2204 commented Jun 7, 2024

When we invoke git ls-files to try to find all LFS files, we don't honour sparse file paths or exclusions. While we should never actually traverse excluded files, using the --exclude-standard option can avoid loading some data with filtered clones, which may result in less data being downloaded.

In addition, we can honour sparse checkouts, since this code path is only used to handle the working tree and we know that the only files we need to consider are those Git actually put in the working tree. The --sparse option is new in 2.35, but we already require 2.42 above, so we can use it unconditionally.

@bk2204 bk2204 changed the title git: improve sparse file support git: improve sparse checkout support Jun 10, 2024
@manoraj
Copy link

manoraj commented Nov 14, 2024

Hi @bk2204, I have tested this in my local repo setup, and it seems to be working well with sparse checkouts. Do you have a plan to merge this anytime soon?

@bk2204
Copy link
Member Author

bk2204 commented Nov 14, 2024

My apologies, this slipped off my radar. I've rebased this PR to make CI run again, and assuming it goes green, I'll mark it for review.

@bk2204 bk2204 marked this pull request as ready for review November 14, 2024 14:49
@bk2204 bk2204 requested a review from a team as a code owner November 14, 2024 14:49
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thank you! I'd love to squeeze this into the upcoming v3.6.0 release.

My only reservation is that we don't have many tests using sparse checkouts (just a pair of tests in one script), and none which exercise this change in particular.

Of course we have lots of tests of git lfs checkout and git lfs pull, the only callers of the ScanLFSFiles() function, which is the only code path by which the LsFilesLFS() function is used, so we know this change it doesn't break anything, which is the most important point to confirm.

But it would also be nice to validate that it has the desired effect of increasing the efficiency of the git ls-files command when a user has a sparse checkout with sparse index.

I managed to write one test, which I could push to this PR's branch, or if you'd prefer to just include it in a commit, that's cool too. I figured it might go into the t/t-pull.sh test script. I've confirmed that it fails without the changes from this PR.

I'll try to craft a similar version for the t/t-checkout.sh script which exercises the git lfs checkout command instead of git lfs pull, unless you beat me to it!

begin_test "pull with partial clone and sparse checkout"
(
  set -e

  # Only test with Git version 2.42.0 as it introduced support for the
  # "objecttype" format option to the "git ls-files" command, which our
  # code requires.
  ensure_git_version_isnt "$VERSION_LOWER" "2.42.0"

  reponame="pull-sparse"
  setup_remote_repo "$reponame"

  clone_repo "$reponame" "$reponame"

  git lfs track "*.dat"

  contents1="a"
  contents1_oid=$(calc_oid "$contents1")
  contents2="b"
  contents2_oid=$(calc_oid "$contents2")
  contents3="c"
  contents3_oid=$(calc_oid "$contents3")

  mkdir in out
  printf "%s" "$contents1" > a.dat
  printf "%s" "$contents2" > in/b.dat
  printf "%s" "$contents3" > out/c.dat
  git add .
  git commit -m "add files"

  git push origin main

  assert_server_object "$reponame" "$contents1_oid"
  assert_server_object "$reponame" "$contents2_oid"
  assert_server_object "$reponame" "$contents3_oid"

  # Create a partial clone with a cone-mode sparse checkout of one directory
  # and a sparse index, which is important because otherwise the "git ls-files" 
  # command ignores the --sparse option and lists all Git LFS files.
  cd ..
  git clone --filter=tree:0 --depth=1 --no-checkout \
    "$GITSERVER/$reponame" "${reponame}-partial"

  cd "${reponame}-partial"
  git sparse-checkout init --cone --sparse-index
  git sparse-checkout set in
  git checkout main

  [ -d "in" ]
  [ ! -e "out" ]

  assert_local_object "$contents1_oid" 1
  assert_local_object "$contents2_oid" 1
  refute_local_object "$contents3_oid"

  git lfs pull 2>&1 | tee pull.log
  grep -q "Downloading LFS objects" pull.log && exit 1

  # Git LFS objects associated with files outside of the sparse cone
  # should not have been pulled.
  refute_local_object "$contents3_oid"
)
end_test

@bk2204 bk2204 force-pushed the sparse-ls-files branch 2 times, most recently from 6fa7936 to ec98486 Compare November 18, 2024 16:42
@chrisd8088
Copy link
Member

chrisd8088 commented Nov 18, 2024

Just FYI, I've made some adjustments to my proposed test above, which I wrote quite late last night. I've put those into the comment above.

I'm also working on a version for git lfs checkout, but it's a bit different since it doesn't fetch Git LFS objects.

@bk2204
Copy link
Member Author

bk2204 commented Nov 18, 2024

I've squashed that test into the PR with a co-author credit and added a test for git lfs checkout as well (which also fails without these changes).

When we invoke `git ls-files` to try to find all LFS files, we don't
honour sparse file paths or exclusions.  While we should never actually
traverse excluded files, using the `--exclude-standard` option can
avoid loading some data with filtered clones, which may result in less
data being downloaded.

In addition, we can honour sparse checkouts, since this code path is
only used to handle the working tree and we know that the only files we
need to consider are those Git actually put in the working tree.  The
`--sparse` option is new in 2.35, but we already require 2.42 above, so
we can use it unconditionally.

Co-authored-by: Chris Darroch <[email protected]>
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for collaborating on some tests for this PR!

I'm going to take the slightly unorthodox step of approving this PR without a second review from another @git-lfs/core team member, in the interest of trying to get to a v3.6.0 release PR quite soon after this is merged.

I think that's OK because (a) I only contributed to the additional tests in this PR, not the change to the client's code, (b) the tests pass CI, (c) we've both looked at the tests and made revisions, and (d) I've checked that each of the relevant assertions in the tests fails as expected if the client code change is removed, not just the first such assertion in each test.

@chrisd8088 chrisd8088 merged commit f124993 into git-lfs:main Nov 18, 2024
10 checks passed
@manoraj
Copy link

manoraj commented Nov 19, 2024

in the interest of trying to get to a v3.6.0 release PR quite soon after this is merged

Hi @chrisd8088 May I know when v3.6.0 is gonna released?

@chrisd8088
Copy link
Member

May I know when v3.6.0 is gonna released?

@manoraj — You're in luck! As it so happens, we're planning to release v3.6.0 later this week, assuming there are no unexpected issues. See PR #5916 for details.

chrisd8088 added a commit that referenced this pull request Oct 16, 2025
In commit 5aa7be5 of PR #5796 we added
tests of the sparse checkout support provided by our "git lfs checkout"
and "git lfs pull" commands, which makes use of the "git ls-files"
command and the --sparse option that was introduced for that command
in Git v2.35.0.

In practice, the "git lfs checkout" and "git lfs pull" commands require
Git v2.42.0 or higher to be available before they invoke "git ls-files",
and otherwise fall back to using the "git ls-tree" command.  We require
at least Git v2.42.0 because that version introduced support for the
"objecttype" field name in the "git ls-files" command's --format option
and we depend on that field to be able to mimic the output format of
the "git ls-tree" command with the "git ls-files" command.  We noted
these details in commit beae114 of
PR #5699, when we revised the runScanLFSFiles() function in our "lfs"
package to choose between the use of "git ls-files" and "git ls-tree".

One difference between the "git ls-files" and "git ls-tree" commands,
however, is that the former lists the files in the Git index (since we
always pass the --cached option) while the latter lists the files in
the Git tree associated with a given reference, which in the case of our
"git lfs checkout" and "git lfs pull" commands is always the current
"HEAD" symbolic reference.

As a consequence, as discussed in issue #6004, if certain files are absent
from the current working tree and Git index as the result of a partial
clone or sparse checkout, the behaviour of the "git lfs checkout" and
"git lfs pull" commands varies depending on the installed version of Git.

If Git v2.42.0 or higher is installed, the "git lfs checkout" and
"git lfs pull" commands invoke the "git ls-files" command and provide
an "attr:filter=lfs" pathspec so the Git command will filter out files
which do not match a Git LFS filter attribute.  However, in order to
be reported, Git LFS pointer files must exist in the Git index; if
they only appear in the working tree or the Git tree associated with
the "HEAD" reference, they will be ignored.

(Note that in a non-bare repository, the "git ls-files" command will only
match the "attr:filter=lfs" pathspec against attributes defined in
".gitattributes" files in the index or working tree, plus any local files
such as the "$GIT_DIR/info/attributes" file.  Any ".gitattributes" files
that are present only in the Git tree associated with the "HEAD" reference
will not be consulted.  In a bare repository, meanwhile, the "git ls-files"
command will by default not match the pathspec against attributes defined
in ".gitattributes" files at all, regardless of whether such files exist
in the index or in the tree referenced by "HEAD".)

If a version of Git older than v2.42.0 is installed and so the
"git ls-tree" command is invoked instead of the "git ls-files" command,
then a full list of the files in the tree-ish referenced by "HEAD" is
returned.  The "git lfs checkout" and "git lfs pull" commands will then
attempt to check out the Git LFS objects associated with all the Git LFS
pointer files found in this list.  In the case of the "git lfs pull"
command, it will also try to fetch those objects if they are not already
present in the local storage directories.

(Note, though, that when the "git lfs checkout" and "git lfs pull" commands
retrieve a list of files using the "git ls-tree" command, they do not
check whether the pointer files they find in that list actually match
Git LFS filter attributes in any ".gitattributes" or other Git attributes
files.  So a user may remove all the ".gitattributes" files from their
working tree and index, commit those changes to "HEAD", and the Git LFS
commands will still attempt to check out objects for any files found in
the "HEAD" commit's tree that can be parsed as valid pointers.  When the
"git ls-files" command is used instead of the "git ls-tree" command to
retrieve a file list, this legacy behaviour does not occur, because the
"attr:filter=lfs" pathspec requires that the "git ls-files" command
only return a list of files which match at least one Git LFS filter
attribute.)

In subsequent commits we will alter how the "git lfs checkout" and
"git lfs pull" commands operate within bare repositories and how they
handle file paths, including by changing the current working directory
to the root of the current working tree, if one is present.  Of necessity,
our tests and documentation will also be expanded to reflect the variable
behaviour of the "git lfs pull" command in particular, since its effects
in a bare repository depend in part on the installed version of Git.

Before we make these changes, we first revise our existing tests of
the "git lfs checkout" and "git lfs pull" commands with partial clones
and sparse checkouts so that the tests confirm the key differences in
behaviour when the installed version of Git is v2.42.0 or higher.
Our tests now demonstrate that with an older version of Git, objects
will be fetched (in the case of the "git lfs pull" command) and checked
out for all Git LFS files, including those outside the configured
sparse cone.

We also update the manual pages for these commands to include an
explanation of how their operation varies depending on the installed
version of Git, how this may affect repositories with partial clones
and sparse checkouts, and the options available to users if they find
the "git lfs checkout" and "git lfs pull" commands appear to be ignoring
certain files.

As well, we edit the initial section in our git-lfs-pull(1) manual page
where we incorrectly state that the command is always equivalent
to running "git lfs fetch" followed by "git lfs checkout", and fix the
formatting of the example commands provided in this section.

When we converted our manual page source files from the Ronn format to
AsciiDoc in commit 0c66dcf of PR #5054,
the two example commands in this section were accidentally merged onto
a single line, and the "<remote>" option for the "git lfs fetch" command
was elided.

We therefore restore the original version of these two example commands
and add leading shell prompt indicators to further clarify that the
example includes two separate commands.
chrisd8088 added a commit that referenced this pull request Oct 16, 2025
In commit 5aa7be5 of PR #5796 we added
tests of the sparse checkout support provided by our "git lfs checkout"
and "git lfs pull" commands, which makes use of the "git ls-files"
command and the --sparse option that was introduced for that command
in Git v2.35.0.

In practice, the "git lfs checkout" and "git lfs pull" commands require
Git v2.42.0 or higher to be available before they invoke "git ls-files",
and otherwise fall back to using the "git ls-tree" command.  We require
at least Git v2.42.0 because that version introduced support for the
"objecttype" field name in the "git ls-files" command's --format option
and we depend on that field to be able to mimic the output format of
the "git ls-tree" command with the "git ls-files" command.  We noted
these details in commit beae114 of
PR #5699, when we revised the runScanLFSFiles() function in our "lfs"
package to choose between the use of "git ls-files" and "git ls-tree".

One difference between the "git ls-files" and "git ls-tree" commands,
however, is that the former lists the files in the Git index (since we
always pass the --cached option) while the latter lists the files in
the Git tree associated with a given reference, which in the case of our
"git lfs checkout" and "git lfs pull" commands is always the current
"HEAD" symbolic reference.

As a consequence, as discussed in issue #6004, if certain files are absent
from the current working tree and Git index as the result of a partial
clone or sparse checkout, the behaviour of the "git lfs checkout" and
"git lfs pull" commands varies depending on the installed version of Git.

If Git v2.42.0 or higher is installed, the "git lfs checkout" and
"git lfs pull" commands invoke the "git ls-files" command and provide
an "attr:filter=lfs" pathspec so the Git command will filter out files
which do not match a Git LFS filter attribute.  However, in order to
be reported, Git LFS pointer files must exist in the Git index; if
they only appear in the working tree or the Git tree associated with
the "HEAD" reference, they will be ignored.

(Note that in a non-bare repository, the "git ls-files" command will only
match the "attr:filter=lfs" pathspec against attributes defined in
".gitattributes" files in the index or working tree, plus any local files
such as the "$GIT_DIR/info/attributes" file.  Any ".gitattributes" files
that are present only in the Git tree associated with the "HEAD" reference
will not be consulted.  In a bare repository, meanwhile, the "git ls-files"
command will by default not match the pathspec against attributes defined
in ".gitattributes" files at all, regardless of whether such files exist
in the index or in the tree referenced by "HEAD".)

If a version of Git older than v2.42.0 is installed and so the
"git ls-tree" command is invoked instead of the "git ls-files" command,
then a full list of the files in the tree-ish referenced by "HEAD" is
returned.  The "git lfs checkout" and "git lfs pull" commands will then
attempt to check out the Git LFS objects associated with all the Git LFS
pointer files found in this list.  In the case of the "git lfs pull"
command, it will also try to fetch those objects if they are not already
present in the local storage directories.

(Note, though, that when the "git lfs checkout" and "git lfs pull" commands
retrieve a list of files using the "git ls-tree" command, they do not
check whether the pointer files they find in that list actually match
Git LFS filter attributes in any ".gitattributes" or other Git attributes
files.  So a user may remove all the ".gitattributes" files from their
working tree and index, commit those changes to "HEAD", and the Git LFS
commands will still attempt to check out objects for any files found in
the "HEAD" commit's tree that can be parsed as valid pointers.  When the
"git ls-files" command is used instead of the "git ls-tree" command to
retrieve a file list, this legacy behaviour does not occur, because the
"attr:filter=lfs" pathspec requires that the "git ls-files" command
only return a list of files which match at least one Git LFS filter
attribute.)

In subsequent commits we will alter how the "git lfs checkout" and
"git lfs pull" commands operate within bare repositories and how they
handle file paths, including by changing the current working directory
to the root of the current working tree, if one is present.  Of necessity,
our tests and documentation will also be expanded to reflect the variable
behaviour of the "git lfs pull" command in particular, since its effects
in a bare repository depend in part on the installed version of Git.

Before we make these changes, we first revise our existing tests of
the "git lfs checkout" and "git lfs pull" commands with partial clones
and sparse checkouts so that the tests confirm the key differences in
behaviour when the installed version of Git is v2.42.0 or higher.
Our tests now demonstrate that with an older version of Git, objects
will be fetched (in the case of the "git lfs pull" command) and checked
out for all Git LFS files, including those outside the configured
sparse cone.

We also update the manual pages for these commands to include an
explanation of how their operation varies depending on the installed
version of Git, how this may affect repositories with partial clones
and sparse checkouts, and the options available to users if they find
the "git lfs checkout" and "git lfs pull" commands appear to be ignoring
certain files.

As well, we edit the initial section in our git-lfs-pull(1) manual page
where we incorrectly state that the command is always equivalent
to running "git lfs fetch" followed by "git lfs checkout", and fix the
formatting of the example commands provided in this section.

When we converted our manual page source files from the Ronn format to
AsciiDoc in commit 0c66dcf of PR #5054,
the two example commands in this section were accidentally merged onto
a single line, and the "<remote>" option for the "git lfs fetch" command
was elided.

We therefore restore the original version of these two example commands
and add leading shell prompt indicators to further clarify that the
example includes two separate commands.
hswong3i pushed a commit to alvistack/git-lfs-git-lfs that referenced this pull request Oct 18, 2025
In commit 5aa7be5 of PR git-lfs#5796 we added
tests of the sparse checkout support provided by our "git lfs checkout"
and "git lfs pull" commands, which makes use of the "git ls-files"
command and the --sparse option that was introduced for that command
in Git v2.35.0.

In practice, the "git lfs checkout" and "git lfs pull" commands require
Git v2.42.0 or higher to be available before they invoke "git ls-files",
and otherwise fall back to using the "git ls-tree" command.  We require
at least Git v2.42.0 because that version introduced support for the
"objecttype" field name in the "git ls-files" command's --format option
and we depend on that field to be able to mimic the output format of
the "git ls-tree" command with the "git ls-files" command.  We noted
these details in commit beae114 of
PR git-lfs#5699, when we revised the runScanLFSFiles() function in our "lfs"
package to choose between the use of "git ls-files" and "git ls-tree".

One difference between the "git ls-files" and "git ls-tree" commands,
however, is that the former lists the files in the Git index (since we
always pass the --cached option) while the latter lists the files in
the Git tree associated with a given reference, which in the case of our
"git lfs checkout" and "git lfs pull" commands is always the current
"HEAD" symbolic reference.

As a consequence, as discussed in issue git-lfs#6004, if certain files are absent
from the current working tree and Git index as the result of a partial
clone or sparse checkout, the behaviour of the "git lfs checkout" and
"git lfs pull" commands varies depending on the installed version of Git.

If Git v2.42.0 or higher is installed, the "git lfs checkout" and
"git lfs pull" commands invoke the "git ls-files" command and provide
an "attr:filter=lfs" pathspec so the Git command will filter out files
which do not match a Git LFS filter attribute.  However, in order to
be reported, Git LFS pointer files must exist in the Git index; if
they only appear in the working tree or the Git tree associated with
the "HEAD" reference, they will be ignored.

(Note that in a non-bare repository, the "git ls-files" command will only
match the "attr:filter=lfs" pathspec against attributes defined in
".gitattributes" files in the index or working tree, plus any local files
such as the "$GIT_DIR/info/attributes" file.  Any ".gitattributes" files
that are present only in the Git tree associated with the "HEAD" reference
will not be consulted.  In a bare repository, meanwhile, the "git ls-files"
command will by default not match the pathspec against attributes defined
in ".gitattributes" files at all, regardless of whether such files exist
in the index or in the tree referenced by "HEAD".)

If a version of Git older than v2.42.0 is installed and so the
"git ls-tree" command is invoked instead of the "git ls-files" command,
then a full list of the files in the tree-ish referenced by "HEAD" is
returned.  The "git lfs checkout" and "git lfs pull" commands will then
attempt to check out the Git LFS objects associated with all the Git LFS
pointer files found in this list.  In the case of the "git lfs pull"
command, it will also try to fetch those objects if they are not already
present in the local storage directories.

(Note, though, that when the "git lfs checkout" and "git lfs pull" commands
retrieve a list of files using the "git ls-tree" command, they do not
check whether the pointer files they find in that list actually match
Git LFS filter attributes in any ".gitattributes" or other Git attributes
files.  So a user may remove all the ".gitattributes" files from their
working tree and index, commit those changes to "HEAD", and the Git LFS
commands will still attempt to check out objects for any files found in
the "HEAD" commit's tree that can be parsed as valid pointers.  When the
"git ls-files" command is used instead of the "git ls-tree" command to
retrieve a file list, this legacy behaviour does not occur, because the
"attr:filter=lfs" pathspec requires that the "git ls-files" command
only return a list of files which match at least one Git LFS filter
attribute.)

In subsequent commits we will alter how the "git lfs checkout" and
"git lfs pull" commands operate within bare repositories and how they
handle file paths, including by changing the current working directory
to the root of the current working tree, if one is present.  Of necessity,
our tests and documentation will also be expanded to reflect the variable
behaviour of the "git lfs pull" command in particular, since its effects
in a bare repository depend in part on the installed version of Git.

Before we make these changes, we first revise our existing tests of
the "git lfs checkout" and "git lfs pull" commands with partial clones
and sparse checkouts so that the tests confirm the key differences in
behaviour when the installed version of Git is v2.42.0 or higher.
Our tests now demonstrate that with an older version of Git, objects
will be fetched (in the case of the "git lfs pull" command) and checked
out for all Git LFS files, including those outside the configured
sparse cone.

We also update the manual pages for these commands to include an
explanation of how their operation varies depending on the installed
version of Git, how this may affect repositories with partial clones
and sparse checkouts, and the options available to users if they find
the "git lfs checkout" and "git lfs pull" commands appear to be ignoring
certain files.

As well, we edit the initial section in our git-lfs-pull(1) manual page
where we incorrectly state that the command is always equivalent
to running "git lfs fetch" followed by "git lfs checkout", and fix the
formatting of the example commands provided in this section.

When we converted our manual page source files from the Ronn format to
AsciiDoc in commit 0c66dcf of PR git-lfs#5054,
the two example commands in this section were accidentally merged onto
a single line, and the "<remote>" option for the "git lfs fetch" command
was elided.

We therefore restore the original version of these two example commands
and add leading shell prompt indicators to further clarify that the
example includes two separate commands.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants