-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Optimize performance for scanning trees in partial clones #5699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2e66b48 to
fca6709
Compare
In two of our test repositories in this file, we never commit the `.gitattributes` file and instead rely on the fact that `git lfs clone` sometimes processes all files in the working tree to determine which are pointer files. This is inefficient and we don't want to rely on this behaviour, especially since it differs from that of `git clone`, so fix our tests so that we explicitly commit the `.gitattributes` file.
Right now, if the user is using partial clone, our call to `git ls-tree` against HEAD is expensive because `git ls-tree` needs to download each blob, which it does incrementally instead of all at once. If we're scanning the tree from HEAD, then we can avoid the expense of doing this by running `git ls-files` with a pattern that matches only LFS files, which makes the operation much cheaper, since we avoid needing to download blobs for many of those objects. We can format the data such that it matches the pattern we expect for `git ls-tree` so that we can avoid modifying most of the calls and continue to let things function in the same way. Do so, but limit our changes to Git 2.42.0 and newer, since the `objecttype` argument is new in that version.
fca6709 to
beae114
Compare
chrisd8088
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you!
In commit cdec9f2 of PR git-lfs#5699 we updated two tests in our t/t-clone.sh test suite to ensure they add and commit the ".gitattributes" files created by "git lfs track" before proceeding to generate commit data with Git LFS objects. Per the commit description, this change was made so as to be explicit about the order of expected Git operations, and to avoid relying on the "git lfs clone" command to search for Git LFS pointer files in the current Git working tree. However, because the "cloneSSL" and "clone ClientCert" tests do not actually run at the moment, due to the bug described in git-lfs#5658. An improper check of the TRAVIS variable causes the tests to exit early and always declare success. Hence, the need to make corresponding changes to ensure these tests also explicitly add and commit their ".gitattributes" files were not made at the time of commit cdec9f2. We expect to enable these tests in a subsequent commit in this PR, at which time they will fail unless we also adjust them to call "git add" and "git commit" on their ".gitattributes" files, so we do that now. We also take the opportunity to make the matching changes to the "clone ClientCert with homedir certs" and "clone with flags" tests, although these tests do run fully and successfully now. While they do not depend on the ".gitattributes" file having been committed to their respective test repositories, aligning their code with that of the other tests in the same script may help avoid problems in the future.
In commit cdec9f2 of PR git-lfs#5699 we updated two tests in our t/t-clone.sh test suite to ensure they add and commit the ".gitattributes" files created by "git lfs track" before proceeding to generate commit data with Git LFS objects. Per the commit description, this change was made so as to be explicit about the order of expected Git operations, and to avoid relying on the "git lfs clone" command to search for Git LFS pointer files in the current Git working tree. However, because the "cloneSSL" and "clone ClientCert" tests do not actually run at the moment, due to the bug described in git-lfs#5658. An improper check of the TRAVIS variable causes the tests to exit early and always declare success. Hence, the need to make corresponding changes to ensure these tests also explicitly add and commit their ".gitattributes" files were not made at the time of commit cdec9f2. We expect to enable these tests in a subsequent commit in this PR, at which time they will fail unless we also adjust them to call "git add" and "git commit" on their ".gitattributes" files, so we do that now. We also take the opportunity to make the matching changes to the "clone ClientCert with homedir certs" and "clone with flags" tests, although these tests do run fully and successfully now. While they do not depend on the ".gitattributes" file having been committed to their respective test repositories, aligning their code with that of the other tests in the same script may help avoid problems in the future.
In commit cdec9f2 of PR git-lfs#5699 we updated two tests in our t/t-clone.sh test suite to ensure they add and commit the ".gitattributes" files created by "git lfs track" before proceeding to generate commit data with Git LFS objects. Per the commit description, this change was made so as to be explicit about the order of expected Git operations, and to avoid relying on the "git lfs clone" command to search for Git LFS pointer files in the current Git working tree. However, because the "cloneSSL" and "clone ClientCert" tests do not actually run at the moment, due to the bug described in git-lfs#5658. An improper check of the TRAVIS variable causes the tests to exit early and always declare success. Hence, the need to make corresponding changes to ensure these tests also explicitly add and commit their ".gitattributes" files were not made at the time of commit cdec9f2. We expect to enable these tests in a subsequent commit in this PR, at which time they will fail unless we also adjust them to call "git add" and "git commit" on their ".gitattributes" files, so we do that now. We also take the opportunity to make the matching changes to the "clone ClientCert with homedir certs" and "clone with flags" tests, although these tests do run fully and successfully now. While they do not depend on the ".gitattributes" file having been committed to their respective test repositories, aligning their code with that of the other tests in the same script may help avoid problems in the future.
In commit cdec9f2 of PR git-lfs#5699 we updated two tests in our t/t-clone.sh test script to ensure they add and commit the ".gitattributes" files created by "git lfs track" before proceeding to generate commit data with Git LFS objects. Per the commit description, this change was made so as to be explicit about the order of expected Git operations, and to avoid relying on the "git lfs clone" command to search for Git LFS pointer files in the current Git working tree. However, two other tests in the same test script have not run for a long time, due to the bug described in git-lfs#5658, where an improper check of the TRAVIS variable causes the tests to exit early and always declare success. Hence, the need to make corresponding changes to ensure these tests also explicitly add and commit their ".gitattributes" files was not clear at the time when commit cdec9f2 was written. We will enable these skipped tests in a subsequent commit in this PR, at which time they will fail unless we also adjust them to call "git add" and "git commit" on their ".gitattributes" files, so we do that now. We also take the opportunity to make matching changes to a third pair of tests, the "clone ClientCert with homedir certs" and "clone with flags" tests, although these do run fully and successfully now. While they do not depend on the ".gitattributes" file being committed to their respective test repositories, aligning their code with that of the other tests in the same test script may help avoid problems in the future.
In commit cdec9f2 of PR git-lfs#5699 we updated two tests in our t/t-clone.sh test script to ensure they add and commit the ".gitattributes" files created by "git lfs track" before proceeding to generate commit data with Git LFS objects. Per the commit description, this change was made so as to be explicit about the order of expected Git operations, and to avoid relying on the "git lfs clone" command to search for Git LFS pointer files in the current Git working tree. However, two other tests in the same test script have not run for a long time, due to the bug described in git-lfs#5658, where an improper check of the TRAVIS variable causes the tests to exit early and always declare success. Hence, the need to make corresponding changes to ensure these tests also explicitly add and commit their ".gitattributes" files was not clear at the time when commit cdec9f2 was written. We will enable these skipped tests in a subsequent commit in this PR, at which time they will fail unless we also adjust them to call "git add" and "git commit" on their ".gitattributes" files, so we do that now. We also take the opportunity to make matching changes to a third pair of tests, the "clone ClientCert with homedir certs" and "clone with flags" tests, although these do run fully and successfully now. While they do not depend on the ".gitattributes" file being committed to their respective test repositories, aligning their code with that of the other tests in the same test script may help avoid problems in the future.
In commit cdec9f2 of PR git-lfs#5699 we updated two tests in our t/t-clone.sh test script to ensure they add and commit the ".gitattributes" files created by "git lfs track" before proceeding to generate commit data with Git LFS objects. Per the commit description, this change was made so as to be explicit about the order of expected Git operations, and to avoid relying on the "git lfs clone" command to search for Git LFS pointer files in the current Git working tree. However, two other tests in the same test script have not run for a long time, due to the bug described in git-lfs#5658, where an improper check of the TRAVIS variable causes the tests to exit early and always declare success. Hence, the need to make corresponding changes to ensure these tests also explicitly add and commit their ".gitattributes" files was not clear at the time when commit cdec9f2 was written. We will enable these skipped tests in a subsequent commit in this PR, at which time they will fail unless we also adjust them to call "git add" and "git commit" on their ".gitattributes" files, so we do that now. We also take the opportunity to make matching changes to a third pair of tests, the "clone ClientCert with homedir certs" and "clone with flags" tests, although these do run fully and successfully now. While they do not depend on the ".gitattributes" file being committed to their respective test repositories, aligning their code with that of the other tests in the same test script may help avoid problems in the future.
|
@bk2204 Note that this change appears to have broken some CI jobs that relied on the previous behavior of scanning all files in the tree instead of only the ones that match Suppose you had a repository with these files: Let's say Let's also assume that the repository was cloned on a case-sensitive filesystem, like ext4. As a result, the Git config In With this change, in The solution, of course, is to update the Of course if I'd agree that the current method of being explicit in |
In commit 5aa7be5 of PR #5796 we added tests of the sparse checkout support provided by our "git lfs checkout" and "git lfs pull" commands, which makes use of the "git ls-files" command and the --sparse option that was introduced for that command in Git v2.35.0. In practice, the "git lfs checkout" and "git lfs pull" commands require Git v2.42.0 or higher to be available before they invoke "git ls-files", and otherwise fall back to using the "git ls-tree" command. We require at least Git v2.42.0 because that version introduced support for the "objecttype" field name in the "git ls-files" command's --format option and we depend on that field to be able to mimic the output format of the "git ls-tree" command with the "git ls-files" command. We noted these details in commit beae114 of PR #5699, when we revised the runScanLFSFiles() function in our "lfs" package to choose between the use of "git ls-files" and "git ls-tree". One difference between the "git ls-files" and "git ls-tree" commands, however, is that the former lists the files in the Git index (since we always pass the --cached option) while the latter lists the files in the Git tree associated with a given reference, which in the case of our "git lfs checkout" and "git lfs pull" commands is always the current "HEAD" symbolic reference. As a consequence, as discussed in issue #6004, if certain files are absent from the current working tree and Git index as the result of a partial clone or sparse checkout, the behaviour of the "git lfs checkout" and "git lfs pull" commands varies depending on the installed version of Git. If Git v2.42.0 or higher is installed, the "git lfs checkout" and "git lfs pull" commands invoke the "git ls-files" command and provide an "attr:filter=lfs" pathspec so the Git command will filter out files which do not match a Git LFS filter attribute. However, in order to be reported, Git LFS pointer files must exist in the Git index; if they only appear in the working tree or the Git tree associated with the "HEAD" reference, they will be ignored. (Note that in a non-bare repository, the "git ls-files" command will only match the "attr:filter=lfs" pathspec against attributes defined in ".gitattributes" files in the index or working tree, plus any local files such as the "$GIT_DIR/info/attributes" file. Any ".gitattributes" files that are present only in the Git tree associated with the "HEAD" reference will not be consulted. In a bare repository, meanwhile, the "git ls-files" command will by default not match the pathspec against attributes defined in ".gitattributes" files at all, regardless of whether such files exist in the index or in the tree referenced by "HEAD".) If a version of Git older than v2.42.0 is installed and so the "git ls-tree" command is invoked instead of the "git ls-files" command, then a full list of the files in the tree-ish referenced by "HEAD" is returned. The "git lfs checkout" and "git lfs pull" commands will then attempt to check out the Git LFS objects associated with all the Git LFS pointer files found in this list. In the case of the "git lfs pull" command, it will also try to fetch those objects if they are not already present in the local storage directories. (Note, though, that when the "git lfs checkout" and "git lfs pull" commands retrieve a list of files using the "git ls-tree" command, they do not check whether the pointer files they find in that list actually match Git LFS filter attributes in any ".gitattributes" or other Git attributes files. So a user may remove all the ".gitattributes" files from their working tree and index, commit those changes to "HEAD", and the Git LFS commands will still attempt to check out objects for any files found in the "HEAD" commit's tree that can be parsed as valid pointers. When the "git ls-files" command is used instead of the "git ls-tree" command to retrieve a file list, this legacy behaviour does not occur, because the "attr:filter=lfs" pathspec requires that the "git ls-files" command only return a list of files which match at least one Git LFS filter attribute.) In subsequent commits we will alter how the "git lfs checkout" and "git lfs pull" commands operate within bare repositories and how they handle file paths, including by changing the current working directory to the root of the current working tree, if one is present. Of necessity, our tests and documentation will also be expanded to reflect the variable behaviour of the "git lfs pull" command in particular, since its effects in a bare repository depend in part on the installed version of Git. Before we make these changes, we first revise our existing tests of the "git lfs checkout" and "git lfs pull" commands with partial clones and sparse checkouts so that the tests confirm the key differences in behaviour when the installed version of Git is v2.42.0 or higher. Our tests now demonstrate that with an older version of Git, objects will be fetched (in the case of the "git lfs pull" command) and checked out for all Git LFS files, including those outside the configured sparse cone. We also update the manual pages for these commands to include an explanation of how their operation varies depending on the installed version of Git, how this may affect repositories with partial clones and sparse checkouts, and the options available to users if they find the "git lfs checkout" and "git lfs pull" commands appear to be ignoring certain files. As well, we edit the initial section in our git-lfs-pull(1) manual page where we incorrectly state that the command is always equivalent to running "git lfs fetch" followed by "git lfs checkout", and fix the formatting of the example commands provided in this section. When we converted our manual page source files from the Ronn format to AsciiDoc in commit 0c66dcf of PR #5054, the two example commands in this section were accidentally merged onto a single line, and the "<remote>" option for the "git lfs fetch" command was elided. We therefore restore the original version of these two example commands and add leading shell prompt indicators to further clarify that the example includes two separate commands.
In commit 5aa7be5 of PR #5796 we added tests of the sparse checkout support provided by our "git lfs checkout" and "git lfs pull" commands, which makes use of the "git ls-files" command and the --sparse option that was introduced for that command in Git v2.35.0. In practice, the "git lfs checkout" and "git lfs pull" commands require Git v2.42.0 or higher to be available before they invoke "git ls-files", and otherwise fall back to using the "git ls-tree" command. We require at least Git v2.42.0 because that version introduced support for the "objecttype" field name in the "git ls-files" command's --format option and we depend on that field to be able to mimic the output format of the "git ls-tree" command with the "git ls-files" command. We noted these details in commit beae114 of PR #5699, when we revised the runScanLFSFiles() function in our "lfs" package to choose between the use of "git ls-files" and "git ls-tree". One difference between the "git ls-files" and "git ls-tree" commands, however, is that the former lists the files in the Git index (since we always pass the --cached option) while the latter lists the files in the Git tree associated with a given reference, which in the case of our "git lfs checkout" and "git lfs pull" commands is always the current "HEAD" symbolic reference. As a consequence, as discussed in issue #6004, if certain files are absent from the current working tree and Git index as the result of a partial clone or sparse checkout, the behaviour of the "git lfs checkout" and "git lfs pull" commands varies depending on the installed version of Git. If Git v2.42.0 or higher is installed, the "git lfs checkout" and "git lfs pull" commands invoke the "git ls-files" command and provide an "attr:filter=lfs" pathspec so the Git command will filter out files which do not match a Git LFS filter attribute. However, in order to be reported, Git LFS pointer files must exist in the Git index; if they only appear in the working tree or the Git tree associated with the "HEAD" reference, they will be ignored. (Note that in a non-bare repository, the "git ls-files" command will only match the "attr:filter=lfs" pathspec against attributes defined in ".gitattributes" files in the index or working tree, plus any local files such as the "$GIT_DIR/info/attributes" file. Any ".gitattributes" files that are present only in the Git tree associated with the "HEAD" reference will not be consulted. In a bare repository, meanwhile, the "git ls-files" command will by default not match the pathspec against attributes defined in ".gitattributes" files at all, regardless of whether such files exist in the index or in the tree referenced by "HEAD".) If a version of Git older than v2.42.0 is installed and so the "git ls-tree" command is invoked instead of the "git ls-files" command, then a full list of the files in the tree-ish referenced by "HEAD" is returned. The "git lfs checkout" and "git lfs pull" commands will then attempt to check out the Git LFS objects associated with all the Git LFS pointer files found in this list. In the case of the "git lfs pull" command, it will also try to fetch those objects if they are not already present in the local storage directories. (Note, though, that when the "git lfs checkout" and "git lfs pull" commands retrieve a list of files using the "git ls-tree" command, they do not check whether the pointer files they find in that list actually match Git LFS filter attributes in any ".gitattributes" or other Git attributes files. So a user may remove all the ".gitattributes" files from their working tree and index, commit those changes to "HEAD", and the Git LFS commands will still attempt to check out objects for any files found in the "HEAD" commit's tree that can be parsed as valid pointers. When the "git ls-files" command is used instead of the "git ls-tree" command to retrieve a file list, this legacy behaviour does not occur, because the "attr:filter=lfs" pathspec requires that the "git ls-files" command only return a list of files which match at least one Git LFS filter attribute.) In subsequent commits we will alter how the "git lfs checkout" and "git lfs pull" commands operate within bare repositories and how they handle file paths, including by changing the current working directory to the root of the current working tree, if one is present. Of necessity, our tests and documentation will also be expanded to reflect the variable behaviour of the "git lfs pull" command in particular, since its effects in a bare repository depend in part on the installed version of Git. Before we make these changes, we first revise our existing tests of the "git lfs checkout" and "git lfs pull" commands with partial clones and sparse checkouts so that the tests confirm the key differences in behaviour when the installed version of Git is v2.42.0 or higher. Our tests now demonstrate that with an older version of Git, objects will be fetched (in the case of the "git lfs pull" command) and checked out for all Git LFS files, including those outside the configured sparse cone. We also update the manual pages for these commands to include an explanation of how their operation varies depending on the installed version of Git, how this may affect repositories with partial clones and sparse checkouts, and the options available to users if they find the "git lfs checkout" and "git lfs pull" commands appear to be ignoring certain files. As well, we edit the initial section in our git-lfs-pull(1) manual page where we incorrectly state that the command is always equivalent to running "git lfs fetch" followed by "git lfs checkout", and fix the formatting of the example commands provided in this section. When we converted our manual page source files from the Ronn format to AsciiDoc in commit 0c66dcf of PR #5054, the two example commands in this section were accidentally merged onto a single line, and the "<remote>" option for the "git lfs fetch" command was elided. We therefore restore the original version of these two example commands and add leading shell prompt indicators to further clarify that the example includes two separate commands.
In commit 5aa7be5 of PR git-lfs#5796 we added tests of the sparse checkout support provided by our "git lfs checkout" and "git lfs pull" commands, which makes use of the "git ls-files" command and the --sparse option that was introduced for that command in Git v2.35.0. In practice, the "git lfs checkout" and "git lfs pull" commands require Git v2.42.0 or higher to be available before they invoke "git ls-files", and otherwise fall back to using the "git ls-tree" command. We require at least Git v2.42.0 because that version introduced support for the "objecttype" field name in the "git ls-files" command's --format option and we depend on that field to be able to mimic the output format of the "git ls-tree" command with the "git ls-files" command. We noted these details in commit beae114 of PR git-lfs#5699, when we revised the runScanLFSFiles() function in our "lfs" package to choose between the use of "git ls-files" and "git ls-tree". One difference between the "git ls-files" and "git ls-tree" commands, however, is that the former lists the files in the Git index (since we always pass the --cached option) while the latter lists the files in the Git tree associated with a given reference, which in the case of our "git lfs checkout" and "git lfs pull" commands is always the current "HEAD" symbolic reference. As a consequence, as discussed in issue git-lfs#6004, if certain files are absent from the current working tree and Git index as the result of a partial clone or sparse checkout, the behaviour of the "git lfs checkout" and "git lfs pull" commands varies depending on the installed version of Git. If Git v2.42.0 or higher is installed, the "git lfs checkout" and "git lfs pull" commands invoke the "git ls-files" command and provide an "attr:filter=lfs" pathspec so the Git command will filter out files which do not match a Git LFS filter attribute. However, in order to be reported, Git LFS pointer files must exist in the Git index; if they only appear in the working tree or the Git tree associated with the "HEAD" reference, they will be ignored. (Note that in a non-bare repository, the "git ls-files" command will only match the "attr:filter=lfs" pathspec against attributes defined in ".gitattributes" files in the index or working tree, plus any local files such as the "$GIT_DIR/info/attributes" file. Any ".gitattributes" files that are present only in the Git tree associated with the "HEAD" reference will not be consulted. In a bare repository, meanwhile, the "git ls-files" command will by default not match the pathspec against attributes defined in ".gitattributes" files at all, regardless of whether such files exist in the index or in the tree referenced by "HEAD".) If a version of Git older than v2.42.0 is installed and so the "git ls-tree" command is invoked instead of the "git ls-files" command, then a full list of the files in the tree-ish referenced by "HEAD" is returned. The "git lfs checkout" and "git lfs pull" commands will then attempt to check out the Git LFS objects associated with all the Git LFS pointer files found in this list. In the case of the "git lfs pull" command, it will also try to fetch those objects if they are not already present in the local storage directories. (Note, though, that when the "git lfs checkout" and "git lfs pull" commands retrieve a list of files using the "git ls-tree" command, they do not check whether the pointer files they find in that list actually match Git LFS filter attributes in any ".gitattributes" or other Git attributes files. So a user may remove all the ".gitattributes" files from their working tree and index, commit those changes to "HEAD", and the Git LFS commands will still attempt to check out objects for any files found in the "HEAD" commit's tree that can be parsed as valid pointers. When the "git ls-files" command is used instead of the "git ls-tree" command to retrieve a file list, this legacy behaviour does not occur, because the "attr:filter=lfs" pathspec requires that the "git ls-files" command only return a list of files which match at least one Git LFS filter attribute.) In subsequent commits we will alter how the "git lfs checkout" and "git lfs pull" commands operate within bare repositories and how they handle file paths, including by changing the current working directory to the root of the current working tree, if one is present. Of necessity, our tests and documentation will also be expanded to reflect the variable behaviour of the "git lfs pull" command in particular, since its effects in a bare repository depend in part on the installed version of Git. Before we make these changes, we first revise our existing tests of the "git lfs checkout" and "git lfs pull" commands with partial clones and sparse checkouts so that the tests confirm the key differences in behaviour when the installed version of Git is v2.42.0 or higher. Our tests now demonstrate that with an older version of Git, objects will be fetched (in the case of the "git lfs pull" command) and checked out for all Git LFS files, including those outside the configured sparse cone. We also update the manual pages for these commands to include an explanation of how their operation varies depending on the installed version of Git, how this may affect repositories with partial clones and sparse checkouts, and the options available to users if they find the "git lfs checkout" and "git lfs pull" commands appear to be ignoring certain files. As well, we edit the initial section in our git-lfs-pull(1) manual page where we incorrectly state that the command is always equivalent to running "git lfs fetch" followed by "git lfs checkout", and fix the formatting of the example commands provided in this section. When we converted our manual page source files from the Ronn format to AsciiDoc in commit 0c66dcf of PR git-lfs#5054, the two example commands in this section were accidentally merged onto a single line, and the "<remote>" option for the "git lfs fetch" command was elided. We therefore restore the original version of these two example commands and add leading shell prompt indicators to further clarify that the example includes two separate commands.
Right now, if the user is using partial clone, our call to
git ls-treeagainst HEAD is expensive becausegit ls-treeneeds to download each blob, which it does incrementally instead of all at once.If we're scanning the tree from HEAD, then we can avoid the expense of doing this by running
git ls-fileswith a pattern that matches only LFS files, which makes the operation much cheaper, since we avoid needing to download blobs for many of those objects. We can format the data such that it matches the pattern we expect forgit ls-treeso that we can avoid modifying most of the calls and continue to let things function in the same way. Do so, but limit our changes to Git 2.42.0 and newer, since theobjecttypeargument is new in that version.In addition, let's fix some tests which rely on an invalid assumption about how we discover and process LFS files.