-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Scan lockable files #1953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Scan lockable files #1953
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ttaylorr
approved these changes
Feb 17, 2017
Contributor
ttaylorr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good for a first pass of this change. I think there a few areas that we could look at going forward:
- The set "Contains" operation, and how that preforms with large trees/push ranges, or lots of locks.
- The callback interface, and whether or not LFS/non-LFS files should share a callback or not.
For now, 👍 .
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 6, 2023
In commit 781c7f5 of PR git-lfs#1953 the GitScannerFoundPointer "callback" element of the GitScanner structure was exported from the "lfs" package by renaming it to "FoundPointer". However, this element has never since been utilized outside of the package, so we can simplify the structure's set of exported elements by renaming it to "foundPointer".
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 6, 2023
The "git lfs push" and "git lfs pre-push" commands utilize the uploadForRefUpdates() function of the "commands" package, which calls the (*uploadContext).buildGitScanner() method to initialize a GitScanner structure with the additional elements necessary to exclude from the results any objects which are reachable from refs known to exist on the given remote. Two of these elements, the FoundLockable callback and the PotentialLockables set of pathspecs locked by other users (as used to report errors or warnings when attempting to push them, specifically when they are not Git LFS objects, for which lock verification is handled separately), were added in commit 9b8bed7 of PR git-lfs#1953, and are initialized following a call to the RemoteForPush() method of the GitScanner structure, which sets two other internal elements of the structure named "remote" and "skippedRefs". As a prelude to converting the GitScanner structure to an interface and then creating a separate UploadGitScanner interface dedicated to the context of scanning for objects to push to a remote, we first replace the RemoteForPush() method with a temporary NewGitScannerForUpload() function, which allows us to accept all the necessary arguments in a single function.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 781c7f5 of PR git-lfs#1953 the GitScannerFoundPointer "callback" element of the GitScanner structure was exported from the "lfs" package by renaming it to "FoundPointer". However, this element has never since been utilized outside of the package, so we can simplify the structure's set of exported elements by renaming it to "foundPointer".
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
The "git lfs push" and "git lfs pre-push" commands utilize the uploadForRefUpdates() function of the "commands" package, which calls the (*uploadContext).buildGitScanner() method to initialize a GitScanner structure with the additional elements necessary to exclude from the results any objects which are reachable from refs known to exist on the given remote. Two of these elements, the FoundLockable callback and the PotentialLockables set of pathspecs locked by other users (as used to report errors or warnings when attempting to push them, specifically when they are not Git LFS objects, for which lock verification is handled separately), were added in commit 9b8bed7 of PR git-lfs#1953, and are initialized following a call to the RemoteForPush() method of the GitScanner structure, which sets two other internal elements of the structure named "remote" and "skippedRefs". As a prelude to converting the GitScanner structure to an interface and then creating a separate UploadGitScanner interface dedicated to the context of scanning for objects to push to a remote, we first replace the RemoteForPush() method with a temporary NewGitScannerForUpload() function, which allows us to accept all the necessary arguments in a single function.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 08c5ae6 of PR git-lfs#1953 the scanRefsToChan() function of the "lfs" package was updated to read the pathspecs of locked files (other than Git LFS objects) from channels created by the catFileBatchCheck() and catFileBatch() functions, and pass each to the GitScannerFoundLockable callback provided in the GitScanner structure. In the case where a callback function is not defined, a no-op callback is used in its place. As a consequence of the change we have now made in a previous commit in this PR, to create a separate UploadGitScanner interface for use when scanning for Git LFS objects to push to a remote, we can now more easily determine when the catFileBatchCheck() and catFileBatch() will report any lock paths to the channels, as they will only do so if we pass a non-nil lockableSet argument. (This is because they call that argument's Check() method before writing to the channels, and that method will return false if it has a nil receiver argument.) We can therefore revise the scanRefsToChan() so it only creates an anonymous goroutine to read from the lock channel returned by catFileBatchCheck(), and only reads directly from the lock channel returned by catFileBatch(), if the lockableSet variable is set, which is true exclusively in the case where an UploadGitScanner type was passed to scanRefsToChan(). This avoids some unnecessary work in the more common cases where just a GitScanner type was passed.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 08c5ae6 of PR git-lfs#1953 the runCatFileBatchCheck() and runCatFileBatch() functions of the "lfs" package were updated to write the pathspecs of locked files which are not Git LFS pointers to a dedicated channel created by the catFileBatchCheck() and catFileBatch() wrapper functions, respectively. The scanRefsToChan() function, which calls both of these with a potentially non-nil *lockableNameSet paramter (and is the only caller to do so), starts an anonymous goroutine to read any events on the channel returned by catFileBatchCheck(), and then reads all events on the channel returned by catFileBatch() directly. In the latter case, this would cause scanRefsToChan() to stall indefinitely unless the channel is closed, so the anonymous function started by the runCatFileBatch() function that writes to the channel always closes the channel upon exit. However, the anonymous function started by the runCatFileBatchCheck() function that writes to its lock path channel does not do the same. While this does not cause a stalled program because scanRefsToChan() creates its own anonymous function to read from the channel, that function will not exit until the progam stops. By adding an explicit close() of the channel at the end of the anonymous function started by runCatFileBatchCheck(), we can ensure the anonymous function which reads that channel will also exit as soon as possible.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 781c7f5 of PR git-lfs#1953 the GitScannerFoundPointer "callback" element of the GitScanner structure was exported from the "lfs" package by renaming it to "FoundPointer". However, this element has never since been utilized outside of the package, so we can simplify the structure's set of exported elements by renaming it to "foundPointer".
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
The "git lfs push" and "git lfs pre-push" commands utilize the uploadForRefUpdates() function of the "commands" package, which calls the (*uploadContext).buildGitScanner() method to initialize a GitScanner structure with the additional elements necessary to exclude from the results any objects which are reachable from refs known to exist on the given remote. Two of these elements, the FoundLockable callback and the PotentialLockables set of pathspecs locked by other users (as used to report errors or warnings when attempting to push them, specifically when they are not Git LFS objects, for which lock verification is handled separately), were added in commit 9b8bed7 of PR git-lfs#1953, and are initialized following a call to the RemoteForPush() method of the GitScanner structure, which sets two other internal elements of the structure named "remote" and "skippedRefs". We can therefore replace the RemoteForPush() method with a NewGitScannerForPush() function, which allows us to accept all the necessary arguments for push operations in a single function, and to avoid exporting the GitScannerFoundLockable and GitScannerSet elements used to identify locked pathspecs.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 08c5ae6 of PR git-lfs#1953 the runCatFileBatchCheck() and runCatFileBatch() functions of the "lfs" package were updated to write the pathspecs of locked files which are not Git LFS pointers to a dedicated channel created by the catFileBatchCheck() and catFileBatch() wrapper functions, respectively. The scanRefsToChan() function, which calls both of these with a potentially non-nil *lockableNameSet paramter (and is the only caller to do so), starts a goroutine with an anonymous function to read any events on the channel returned by catFileBatchCheck(), and then reads all events on the channel returned by catFileBatch() directly. In the latter case, this would cause scanRefsToChan() to stall indefinitely unless the channel is closed, so the anonymous function started by the runCatFileBatch() function that writes to the channel always closes the channel upon exit. However, the anonymous function started by the runCatFileBatchCheck() function that writes to its lock path channel does not do the same. While this does not cause a stalled program because scanRefsToChan() creates its own anonymous function to read from the channel, that function will not exit until the progam stops. By adding an explicit close() of the channel at the end of the anonymous function started by runCatFileBatchCheck(), we can ensure the anonymous function which reads that channel will also exit as soon as possible.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 781c7f5 of PR git-lfs#1953 the GitScannerFoundPointer "callback" element of the GitScanner structure was exported from the "lfs" package by renaming it to "FoundPointer". However, this element has never since been utilized outside of the package, so we can simplify the structure's set of exported elements by renaming it to "foundPointer".
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
The "git lfs push" and "git lfs pre-push" commands utilize the uploadForRefUpdates() function of the "commands" package, which calls the (*uploadContext).buildGitScanner() method to initialize a GitScanner structure with the additional elements necessary to exclude from the results any objects which are reachable from refs known to exist on the given remote. Two of these elements, the FoundLockable callback and the PotentialLockables set of pathspecs locked by other users (as used to report errors or warnings when attempting to push them, specifically when they are not Git LFS objects, for which lock verification is handled separately), were added in commit 9b8bed7 of PR git-lfs#1953, and are initialized following a call to the RemoteForPush() method of the GitScanner structure, which sets two other internal elements of the structure named "remote" and "skippedRefs". We can therefore replace the RemoteForPush() method with a NewGitScannerForPush() function, which allows us to accept all the necessary arguments for push operations in a single function, and to avoid exporting the GitScannerFoundLockable and GitScannerSet elements used to identify locked pathspecs.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 08c5ae6 of PR git-lfs#1953 the runCatFileBatchCheck() and runCatFileBatch() functions of the "lfs" package were updated to write the pathspecs of locked files which are not Git LFS pointers to a dedicated channel created by the catFileBatchCheck() and catFileBatch() wrapper functions, respectively. The scanRefsToChan() function, which calls both of these with a potentially non-nil *lockableNameSet paramter (and is the only caller to do so), starts a goroutine with an anonymous function to read any events on the channel returned by catFileBatchCheck(), and then reads all events on the channel returned by catFileBatch() directly. In the latter case, this would cause scanRefsToChan() to stall indefinitely unless the channel is closed, so the anonymous function started by the runCatFileBatch() function that writes to the channel always closes the channel upon exit. However, the anonymous function started by the runCatFileBatchCheck() function that writes to its lock path channel does not do the same. While this does not cause a stalled program because scanRefsToChan() creates its own anonymous function to read from the channel, that function will not exit until the progam stops. By adding an explicit close() of the channel at the end of the anonymous function started by runCatFileBatchCheck(), we can ensure the anonymous function which reads that channel will also exit as soon as possible.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 781c7f5 of PR git-lfs#1953 the GitScannerFoundPointer "callback" element of the GitScanner structure was exported from the "lfs" package by renaming it to "FoundPointer". However, this element has never since been utilized outside of the package, so we can simplify the structure's set of exported elements by renaming it to "foundPointer".
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
The "git lfs push" and "git lfs pre-push" commands utilize the uploadForRefUpdates() function of the "commands" package, which calls the (*uploadContext).buildGitScanner() method to initialize a GitScanner structure with the additional elements necessary to exclude from the results any objects which are reachable from refs known to exist on the given remote. Two of these elements, the FoundLockable callback and the PotentialLockables set of pathspecs locked by other users (as used to report errors or warnings when attempting to push them, specifically when they are not Git LFS objects, for which lock verification is handled separately), were added in commit 9b8bed7 of PR git-lfs#1953, and are initialized following a call to the RemoteForPush() method of the GitScanner structure, which sets two other internal elements of the structure named "remote" and "skippedRefs". We can therefore replace the RemoteForPush() method with a NewGitScannerForPush() function, which allows us to accept all the necessary arguments for push operations in a single function, and to avoid exporting the GitScannerFoundLockable and GitScannerSet elements used to identify locked pathspecs.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Jun 7, 2023
In commit 08c5ae6 of PR git-lfs#1953 the runCatFileBatchCheck() and runCatFileBatch() functions of the "lfs" package were updated to write the pathspecs of locked files which are not Git LFS pointers to a dedicated channel created by the catFileBatchCheck() and catFileBatch() wrapper functions, respectively. The scanRefsToChan() function, which calls both of these with a potentially non-nil *lockableNameSet paramter (and is the only caller to do so), starts a goroutine with an anonymous function to read any events on the channel returned by catFileBatchCheck(), and then reads all events on the channel returned by catFileBatch() directly. In the latter case, this would cause scanRefsToChan() to stall indefinitely unless the channel is closed, so the anonymous function started by the runCatFileBatch() function that writes to the channel always closes the channel upon exit. However, the anonymous function started by the runCatFileBatchCheck() function that writes to its lock path channel does not do the same. While this does not cause a stalled program because scanRefsToChan() creates its own anonymous function to read from the channel, that function will not exit until the progam stops. By adding an explicit close() of the channel at the end of the anonymous function started by runCatFileBatchCheck(), we can ensure the anonymous function which reads that channel will also exit as soon as possible.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Apr 3, 2025
Since commit 1412d6e of PR git-lfs#3634, during push operations the Git LFS client has sometimes avoided reporting an error when an object to be pushed is missing locally if the remote server reports that it has a copy already. To implement this feature, a new Missing element was added to the Transfer and objectTuple structures in our "tq" package, and the Add() method of the TransferQueue structure in that package was updated to accept an additional "missing" flag, which the method uses to set the Missing element of the objectTuple structure it creates and sends to the "incoming" channel. Batches of objects to be pushed are then gathered from this channel by the collectBatches() method of the TransferQueue structure. As batches of objectTuple structures are collected, they are passed to the enqueueAndCollectRetriesFor() method, which converts them to Transfer structures using the ToTransfers() method and then passes them to the Batch() function, which is defined in our tq/api.go source file. This function initializes a batchRequest structure which contains the set of Transfer structures as its Objects element, and then passes those to the Batch() method specific to the current batch transfer adapter's structure. These Batch() methods return a BatchResponse structure, which the Batch() function then returns to the enqueueAndCollectRetriesFor() method. The BatchResponse structure also contains an Objects element which is another set of Transfer structures that represent the per-object metadata received from the remote server. After the enqueueAndCollectRetriesFor() method receives a BatchResponse during a push operation, if any of the Transfer structures in that response define an upload action to be performed, this implies that the remote server does not have a copy of those objects. As one of the changes we made in PR git-lfs#3634, we introduced a step into the enqueueAndCollectRetriesFor() method which halts the push operation if the server's response indicates that the server lacks a copy of an object, and if the "missing" value passed to the Add() method for that object was set to "true". (Initially, this step also decremented the count of the number of objects waiting to be transferred, but this created the potential for stalled push operations, and so another approach to handling an early exit from the batch transfer process was implemented in commit eb83fcd of PR git-lfs#3800.) Also in PR git-lfs#3634, several methods of the uploadContext structure in our "commands" package were revised to set the "missing" value for each object before calling the Add() method of the TransferQueue structure. Specifically, the ensureFile() method of the uploadContext structure first checks for the presence of the object data file in the .git/lfs/objects local storage directories. If that does not exist, then the method looks for a file in the working tree at the path associated with the Git LFS pointer that corresponds to the object. If that file also does not exist, and if the "lfs.allowIncompletePush" Git configuration option is set to "false", then the method returns "true", and this value is ultimately used for the "missing" argument in the call to the TransferQueue's Add() method for the object. Note that the file in the working tree may have any content, or be entirely empty; its simple presence is enough to change the value returned by the ensureFile() method, given the other conditions described above. We expect to revise this unintuitive behaviour in a subsequent commit in this PR. Before we make that change, however, we first adjust two aspects of the implementation from PR git-lfs#3634 so as to simplify our handling of missing objects during push operations. We make one of these adjustments in this commit, and the other in a following commit in this PR. As noted above, the enqueueAndCollectRetriesFor() method halts a push operation if the server's response indicates that the server lacks a copy of an object, and if the "missing" value passed to the Add() method for that object was set to "true". In order for the method to determine an object's "missing" value, at present it consults the Missing element of the object's Transfer structure from the set of those structures provided in the BatchResponse structure. However, Git LFS remote servers do not report an object's local "missing" status, so these Missing fields are not populated from the server's response. Instead, the appropriate values for these Missing fields in the Transfer structures representing the list of objects in the server's response are set by either the Batch() method of the tqClient structure, which handles HTTP-based Git LFS Batch API requests and responses, and the Batch() method of the SSHBatchClient structure, which handles the equivalent in the SSH version of the Git LFS object transfer protocol. These methods are invoked by the Batch() function in the "tq" package, and are passed a batchRequest structure, whose Objects element has been populated by that function from the list of Transfer structures passed to it by the enqueueAndCollectRetriesFor() method. The Batch() methods for both the HTTP and SSH versions of the Git LFS object transfer protocol iterate over this input set of Transfer structures and create local maps from each object's ID to the value of the Missing element from the object's input Transfer structure. After performing the relevant HTTP or SSH batch request and receiving a response from the remote server, the Batch() methods then iterate over the list of objects in the server's reply and set each object's output Transfer structure's Missing element from the value found in the local map. This design dates from the original implementation for the HTTP-based Git LFS transfer protocol in commit 1412d6e of PR git-lfs#3634, and was then implemented for the SSH-based protocol in commit 594f8e3 of PR git-lfs#4446, when that protocol was introduced. However, we can simplify this design by eliminating the need to create and reference local maps in the Batch() methods for each transfer protocol, if we instead make use of the map of object IDs to lists of objectTuple structures held in the "transfers" element of the TransferQueue structure. This "transfers" element contains a map from object IDs to "objects" structures, which in turn have "objects" elements that are lists of objectTuple structures. These are populated by the remember() method of the TransferQueue method, which is called by the Add() method to record all the input data it is passed when an object is added to the transfer queue. One potential complication is that the "objects" structure allows for multiple objectTuples to be recorded for a common object ID, in case the Add() method is called repeatedly for the same Git LFS object. Hypothetically, this could allow for different values of the "missing" argument to the Add() method to be recorded for the same object ID during the operation of the transfer queue. In practice, though, this is not possible, because the Add() method will only be called once per unique object ID during push operations. When a "git lfs pre-push" or "git lfs push" command runs, the UploadPointers() method of the uploadContext structure in the "commands" is the only caller of the Add() method of the TransferQueue structure, and the UploadPointers() method only performs that call for the object IDs returned by the prepareUpload() method, which it calls before the TransferQueue's Add() method. In the common case where a Git LFS push command scans through the history of one or more Git references to find Git LFS pointers, the UploadPointers() method of the uploadContext structure is called for each individual pointer found in the Git history. As a consequence, the objects are passed individually to the prepareUpload() method. That method checks if the given object have been processed already by calling the HasUploaded() method, and if that returns "true", the prepareUpload() method returns nothing, so the UploadPointers() method in turn does not invoke the UploadPointers() method for that object. If the HasUploaded() method returns "false", though, the object is returned by the prepareUpload() method, so the UploadPointers() method invokes the TransferQueue's Add() method, and then calls the SetUploaded() method to register the object ID in the set of IDs that have been processed so far. Note that is unlikely for multiple Git LFS pointers with the same object ID to be found in a repository's Git history in the first place, because these pointers will normally be identical to each other, and so will have identical Git blob SHA-1 values. Therefore when we run "git rev-list --objects ..." to retrieve the Git blobs reachable from a set of Git references, Git LFS pointers with the same Git LFS object IDs will normally only be represented by a single Git blob. However, as demonstrated by our t/t-duplicate-oids.sh test script, it is possible for Git LFS pointers with legacy values for their "version" field to exist, which may then reference the same SHA-256 Git LFS object ID as a modern Git LFS pointer but be stored as distinct Git blobs. Similarly, it is possible for a Git LFS pointer with extension fields to resolve to the same final content data as a pointer without those fields, or with different extension fields, and so could also result in distinct Git blobs that are all valid Git LFS pointers with the same SHA-256 object IDs. Aside from the common case where a Git LFS push command scans through the history of one or more Git references to find Git LFS pointers, the "git lfs push" command may also be invoked with its --object-id or --stdin options, along with a list of specific Git LFS object IDs to be uploaded. When handling these command-line options, the uploadsWithObjectIDs() function in our "commands" package will invoke the UploadPointers() method of the uploadContext structure with the full set of Git LFS object IDs passed to the command. Because a user may inadvertently supply the same object ID more than once, this means duplicate IDs may be passed by the UploadPointers() method to the prepareUpload() method, in which case that method's call to the HasUploaded() method would return the same "false" value for all the duplicate object IDs in the list. Hence, this check is not sufficient in this case to avoid duplicate IDs being returned by the prepareUpload() method. Fortunately, though, the prepareUpload() method performs another round of de-duplication using a local StringSet from the "tools" package of the standard Go library. Each object ID is added to this "uniqOids" set as it is processed, unless the set already contains that object ID, in which case the object is skipped. This second level of de-duplication logic ensures that it is not possible for UploadPointers() method to receive object IDs from the prepareUpload() method unless they have not been processed before. This in turn guarantees that the Add() method of the TransferQueue structure will never be called for the same Git LFS object ID more than once during a push operation. (As a sidebar, the use of the "uniqOids" set to de-duplicate objects during push operations was originally introduced in commit d770dfa of PR git-lfs#1600 to guard against duplicate object IDs which arose from distinct Git LFS pointers with the same SHA-256 object IDs, as can result from the use of legacy pointer "version" values. At the time, the prepareUpload() method was called by an upload() function rather than the newer UploadPointers() method. During normal "git lfs pre-push" and "git lfs push" commands, without the --object-id or --stdin options, this upload() function would be passed all the Git LFS objects found from scanning the appropriate Git history, and so might encounter duplicate object IDs which would not be filtered by checking the set maintained by the SetUploaded() method, since that was only called after the Add() method of the TransferQueue structure. Later, with the changes in PR git-lfs#1953, the logic changed such that objects are only handled individually during push operations when the --object-id and --stdin options are not supplied, and since then the checks of the "uniqOids" set in the prepareUpload() method have only guarded against duplicate object IDs provided via those command-line options.) As described above, we can rely on the object ID de-duplication in the "commands" package to prevent the same object being requested for upload more than once. This ensures that the "objects" structure found in the "transfers" map element of the TransferQueue structure will only contain a single objectTuple. Therefore, in our revisions to the enqueueAndCollectRetriesFor() method, we can simply check the Missing element of that first (and only) objectTuple to determine whether the "missing" argument was set to "true" when the Add() method of the TransferQueue structure was called for that unique object. (Note that in PR git-lfs#2476, the "transfers" structure was updated to maintain a list of objectTuple structures for each object ID, in order to handle duplicate OIDs passed during "smudge" filter operations using Git's "delay" capability for long-running filter drivers like the "git lfs filter-process" command. This filter never handles upload operations, though, so the de-duplication logic in the "tq" package is not applicable to push operations, since those are fully de-duplicated in the "commands" package.) In prior commits in this PR we refactored and expanded the tests relevant to the handling of missing objects during push operations, in part to provide additional validation of the changes in this commit, including several new tests in our t/t-push-failures-local.sh which attempt to perform push operations with missing objects while using the SSH-based version of the Git LFS object transfer protocol. In particular, the "push reject missing object (lfs.allowincompletepush false) (git-lfs-transfer)" test checks that when an object is missing locally, and is detected as such by the ensureFile() method of the uploadContext structure because there is also no file in the working tree at the path associated with the object's Git LFS pointer, the Git LFS client abandons the push operation immediately and does not proceed to try to upload either the missing object or any other objects. The test validates this behaviour by checking that no objects are present on the server after the "git push" command exits, and also by looking for error and trace log messages the client should output when it halts the push operation. We added the trace log message in a previous commit in this PR because we intend to revise the error message in a subsequent commit, which will make the message more consistent, so it is not specific to the case where no file is found in the working tree, but will also mean we cannot use the revised error message as an indicator that the client has halted the push operation where we expect that it should.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
Apr 3, 2025
Since commit 1412d6e of PR git-lfs#3634, during push operations the Git LFS client has sometimes avoided reporting an error when an object to be pushed is missing locally if the remote server reports that it has a copy already. To implement this feature, a new Missing element was added to the Transfer and objectTuple structures in our "tq" package, and the Add() method of the TransferQueue structure in that package was updated to accept an additional "missing" flag, which the method uses to set the Missing element of the objectTuple structure it creates and sends to the "incoming" channel. Batches of objects to be pushed are then gathered from this channel by the collectBatches() method of the TransferQueue structure. As batches of objectTuple structures are collected, they are passed to the enqueueAndCollectRetriesFor() method, which converts them to Transfer structures using the ToTransfers() method and then passes them to the Batch() function, which is defined in our tq/api.go source file. This function initializes a batchRequest structure which contains the set of Transfer structures as its Objects element, and then passes those to the Batch() method specific to the current batch transfer adapter's structure. These Batch() methods return a BatchResponse structure, which the Batch() function then returns to the enqueueAndCollectRetriesFor() method. The BatchResponse structure also contains an Objects element which is another set of Transfer structures that represent the per-object metadata received from the remote server. After the enqueueAndCollectRetriesFor() method receives a BatchResponse during a push operation, if any of the Transfer structures in that response define an upload action to be performed, this implies that the remote server does not have a copy of those objects. As one of the changes we made in PR git-lfs#3634, we introduced a step into the enqueueAndCollectRetriesFor() method which halts the push operation if the server's response indicates that the server lacks a copy of an object, and if the "missing" value passed to the Add() method for that object was set to "true". (Initially, this step also decremented the count of the number of objects waiting to be transferred, but this created the potential for stalled push operations, and so another approach to handling an early exit from the batch transfer process was implemented in commit eb83fcd of PR git-lfs#3800.) Also in PR git-lfs#3634, several methods of the uploadContext structure in our "commands" package were revised to set the "missing" value for each object before calling the Add() method of the TransferQueue structure. Specifically, the ensureFile() method of the uploadContext structure first checks for the presence of the object data file in the .git/lfs/objects local storage directories. If that does not exist, then the method looks for a file in the working tree at the path associated with the Git LFS pointer that corresponds to the object. If that file also does not exist, and if the "lfs.allowIncompletePush" Git configuration option is set to "false", then the method returns "true", and this value is ultimately used for the "missing" argument in the call to the TransferQueue's Add() method for the object. Note that the file in the working tree may have any content, or be entirely empty; its simple presence is enough to change the value returned by the ensureFile() method, given the other conditions described above. We expect to revise this unintuitive behaviour in a subsequent commit in this PR. Before we make that change, however, we first adjust two aspects of the implementation from PR git-lfs#3634 so as to simplify our handling of missing objects during push operations. We make one of these adjustments in this commit, and the other in a following commit in this PR. As noted above, the enqueueAndCollectRetriesFor() method halts a push operation if the server's response indicates that the server lacks a copy of an object, and if the "missing" value passed to the Add() method for that object was set to "true". In order for the method to determine an object's "missing" value, at present it consults the Missing element of the object's Transfer structure from the set of those structures provided in the BatchResponse structure. However, Git LFS remote servers do not report an object's local "missing" status, so these Missing fields are not populated from the server's response. Instead, the appropriate values for these Missing fields in the Transfer structures representing the list of objects in the server's response are set by either the Batch() method of the tqClient structure, which handles HTTP-based Git LFS Batch API requests and responses, and the Batch() method of the SSHBatchClient structure, which handles the equivalent in the SSH version of the Git LFS object transfer protocol. These methods are invoked by the Batch() function in the "tq" package, and are passed a batchRequest structure, whose Objects element has been populated by that function from the list of Transfer structures passed to it by the enqueueAndCollectRetriesFor() method. The Batch() methods for both the HTTP and SSH versions of the Git LFS object transfer protocol iterate over this input set of Transfer structures and create local maps from each object's ID to the value of the Missing element from the object's input Transfer structure. After performing the relevant HTTP or SSH batch request and receiving a response from the remote server, the Batch() methods then iterate over the list of objects in the server's reply and set each object's output Transfer structure's Missing element from the value found in the local map. This design dates from the original implementation for the HTTP-based Git LFS transfer protocol in commit 1412d6e of PR git-lfs#3634, and was then implemented for the SSH-based protocol in commit 594f8e3 of PR git-lfs#4446, when that protocol was introduced. However, we can simplify this design by eliminating the need to create and reference local maps in the Batch() methods for each transfer protocol, if we instead make use of the map of object IDs to lists of objectTuple structures held in the "transfers" element of the TransferQueue structure. This "transfers" element contains a map from object IDs to "objects" structures, which in turn have "objects" elements that are lists of objectTuple structures. These are populated by the remember() method of the TransferQueue method, which is called by the Add() method to record all the input data it is passed when an object is added to the transfer queue. One potential complication is that the "objects" structure allows for multiple objectTuples to be recorded for a common object ID, in case the Add() method is called repeatedly for the same Git LFS object. Hypothetically, this could allow for different values of the "missing" argument to the Add() method to be recorded for the same object ID during the operation of the transfer queue. In practice, though, this is not possible, because the Add() method will only be called once per unique object ID during push operations. When a "git lfs pre-push" or "git lfs push" command runs, the UploadPointers() method of the uploadContext structure in the "commands" is the only caller of the Add() method of the TransferQueue structure, and the UploadPointers() method only performs that call for the object IDs returned by the prepareUpload() method, which it calls before the TransferQueue's Add() method. In the common case where a Git LFS push command scans through the history of one or more Git references to find Git LFS pointers, the UploadPointers() method of the uploadContext structure is called for each individual pointer found in the Git history. As a consequence, the objects are passed individually to the prepareUpload() method. That method checks if the given object have been processed already by calling the HasUploaded() method, and if that returns "true", the prepareUpload() method returns nothing, so the UploadPointers() method in turn does not invoke the UploadPointers() method for that object. If the HasUploaded() method returns "false", though, the object is returned by the prepareUpload() method, so the UploadPointers() method invokes the TransferQueue's Add() method, and then calls the SetUploaded() method to register the object ID in the set of IDs that have been processed so far. Note that is unlikely for multiple Git LFS pointers with the same object ID to be found in a repository's Git history in the first place, because these pointers will normally be identical to each other, and so will have identical Git blob SHA-1 values. Therefore when we run "git rev-list --objects ..." to retrieve the Git blobs reachable from a set of Git references, Git LFS pointers with the same Git LFS object IDs will normally only be represented by a single Git blob. However, as demonstrated by our t/t-duplicate-oids.sh test script, it is possible for Git LFS pointers with legacy values for their "version" field to exist, which may then reference the same SHA-256 Git LFS object ID as a modern Git LFS pointer but be stored as distinct Git blobs. Similarly, it is possible for a Git LFS pointer with extension fields to resolve to the same final content data as a pointer without those fields, or with different extension fields, and so could also result in distinct Git blobs that are all valid Git LFS pointers with the same SHA-256 object IDs. Aside from the common case where a Git LFS push command scans through the history of one or more Git references to find Git LFS pointers, the "git lfs push" command may also be invoked with its --object-id or --stdin options, along with a list of specific Git LFS object IDs to be uploaded. When handling these command-line options, the uploadsWithObjectIDs() function in our "commands" package will invoke the UploadPointers() method of the uploadContext structure with the full set of Git LFS object IDs passed to the command. Because a user may inadvertently supply the same object ID more than once, this means duplicate IDs may be passed by the UploadPointers() method to the prepareUpload() method, in which case that method's call to the HasUploaded() method would return the same "false" value for all the duplicate object IDs in the list. Hence, this check is not sufficient in this case to avoid duplicate IDs being returned by the prepareUpload() method. Fortunately, though, the prepareUpload() method performs another round of de-duplication using a local StringSet from the "tools" package of the standard Go library. Each object ID is added to this "uniqOids" set as it is processed, unless the set already contains that object ID, in which case the object is skipped. This second level of de-duplication logic ensures that it is not possible for UploadPointers() method to receive object IDs from the prepareUpload() method unless they have not been processed before. This in turn guarantees that the Add() method of the TransferQueue structure will never be called for the same Git LFS object ID more than once during a push operation. (As a sidebar, the use of the "uniqOids" set to de-duplicate objects during push operations was originally introduced in commit d770dfa of PR git-lfs#1600 to guard against duplicate object IDs which arose from distinct Git LFS pointers with the same SHA-256 object IDs, as can result from the use of legacy pointer "version" values. At the time, the prepareUpload() method was called by an upload() function rather than the newer UploadPointers() method. During normal "git lfs pre-push" and "git lfs push" commands, without the --object-id or --stdin options, this upload() function would be passed all the Git LFS objects found from scanning the appropriate Git history, and so might encounter duplicate object IDs which would not be filtered by checking the set maintained by the SetUploaded() method, since that was only called after the Add() method of the TransferQueue structure. Later, with the changes in PR git-lfs#1953, the logic changed such that objects are only handled individually during push operations when the --object-id and --stdin options are not supplied, and since then the checks of the "uniqOids" set in the prepareUpload() method have only guarded against duplicate object IDs provided via those command-line options.) As described above, we can rely on the object ID de-duplication in the "commands" package to prevent the same object being requested for upload more than once. This ensures that the "objects" structure found in the "transfers" map element of the TransferQueue structure will only contain a single objectTuple. Therefore, in our revisions to the enqueueAndCollectRetriesFor() method, we can simply check the Missing element of that first (and only) objectTuple to determine whether the "missing" argument was set to "true" when the Add() method of the TransferQueue structure was called for that unique object. (Note that in PR git-lfs#2476, the "transfers" structure was updated to maintain a list of objectTuple structures for each object ID, in order to handle duplicate OIDs passed during "smudge" filter operations using Git's "delay" capability for long-running filter drivers like the "git lfs filter-process" command. This filter never handles upload operations, though, so the de-duplication logic in the "tq" package is not applicable to push operations, since those are fully de-duplicated in the "commands" package.) In prior commits in this PR we refactored and expanded the tests relevant to the handling of missing objects during push operations, in part to provide additional validation of the changes in this commit, including several new tests in our t/t-push-failures-local.sh which attempt to perform push operations with missing objects while using the SSH-based version of the Git LFS object transfer protocol. In particular, the "push reject missing object (lfs.allowincompletepush false) (git-lfs-transfer)" test checks that when an object is missing locally, and is detected as such by the ensureFile() method of the uploadContext structure because there is also no file in the working tree at the path associated with the object's Git LFS pointer, the Git LFS client abandons the push operation immediately and does not proceed to try to upload either the missing object or any other objects. The test validates this behaviour by checking that no objects are present on the server after the "git push" command exits, and also by looking for error and trace log messages the client should output when it halts the push operation. We added the trace log message in a previous commit in this PR because we intend to revise the error message in a subsequent commit, which will make the message more consistent, so it is not specific to the case where no file is found in the working tree, but will also mean we cannot use the revised error message as an indicator that the client has halted the push operation where we expect that it should.
chrisd8088
added a commit
to chrisd8088/git-lfs
that referenced
this pull request
May 22, 2025
Since commit 1412d6e of PR git-lfs#3634, during push operations the Git LFS client has sometimes avoided reporting an error when an object to be pushed is missing locally if the remote server reports that it has a copy already. To implement this feature, a new Missing element was added to the Transfer and objectTuple structures in our "tq" package, and the Add() method of the TransferQueue structure in that package was updated to accept an additional "missing" flag, which the method uses to set the Missing element of the objectTuple structure it creates and sends to the "incoming" channel. Batches of objects to be pushed are then gathered from this channel by the collectBatches() method of the TransferQueue structure. As batches of objectTuple structures are collected, they are passed to the enqueueAndCollectRetriesFor() method, which converts them to Transfer structures using the ToTransfers() method and then passes them to the Batch() function, which is defined in our tq/api.go source file. This function initializes a batchRequest structure which contains the set of Transfer structures as its Objects element, and then passes those to the Batch() method specific to the current batch transfer adapter's structure. These Batch() methods return a BatchResponse structure, which the Batch() function then returns to the enqueueAndCollectRetriesFor() method. The BatchResponse structure also contains an Objects element which is another set of Transfer structures that represent the per-object metadata received from the remote server. After the enqueueAndCollectRetriesFor() method receives a BatchResponse during a push operation, if any of the Transfer structures in that response define an upload action to be performed, this implies that the remote server does not have a copy of those objects. As one of the changes we made in PR git-lfs#3634, we introduced a step into the enqueueAndCollectRetriesFor() method which halts the push operation if the server's response indicates that the server lacks a copy of an object, and if the "missing" value passed to the Add() method for that object was set to "true". (Initially, this step also decremented the count of the number of objects waiting to be transferred, but this created the potential for stalled push operations, and so another approach to handling an early exit from the batch transfer process was implemented in commit eb83fcd of PR git-lfs#3800.) Also in PR git-lfs#3634, several methods of the uploadContext structure in our "commands" package were revised to set the "missing" value for each object before calling the Add() method of the TransferQueue structure. Specifically, the ensureFile() method of the uploadContext structure first checks for the presence of the object data file in the .git/lfs/objects local storage directories. If that does not exist, then the method looks for a file in the working tree at the path associated with the Git LFS pointer that corresponds to the object. If that file also does not exist, and if the "lfs.allowIncompletePush" Git configuration option is set to "false", then the method returns "true", and this value is ultimately used for the "missing" argument in the call to the TransferQueue's Add() method for the object. Note that the file in the working tree may have any content, or be entirely empty; its simple presence is enough to change the value returned by the ensureFile() method, given the other conditions described above. We expect to revise this unintuitive behaviour in a subsequent commit in this PR. Before we make that change, however, we first adjust two aspects of the implementation from PR git-lfs#3634 so as to simplify our handling of missing objects during push operations. We make one of these adjustments in this commit, and the other in a following commit in this PR. As noted above, the enqueueAndCollectRetriesFor() method halts a push operation if the server's response indicates that the server lacks a copy of an object, and if the "missing" value passed to the Add() method for that object was set to "true". In order for the method to determine an object's "missing" value, at present it consults the Missing element of the object's Transfer structure from the set of those structures provided in the BatchResponse structure. However, Git LFS remote servers do not report an object's local "missing" status, so these Missing fields are not populated from the server's response. Instead, the appropriate values for these Missing fields in the Transfer structures representing the list of objects in the server's response are set by either the Batch() method of the tqClient structure, which handles HTTP-based Git LFS Batch API requests and responses, and the Batch() method of the SSHBatchClient structure, which handles the equivalent in the SSH version of the Git LFS object transfer protocol. These methods are invoked by the Batch() function in the "tq" package, and are passed a batchRequest structure, whose Objects element has been populated by that function from the list of Transfer structures passed to it by the enqueueAndCollectRetriesFor() method. The Batch() methods for both the HTTP and SSH versions of the Git LFS object transfer protocol iterate over this input set of Transfer structures and create local maps from each object's ID to the value of the Missing element from the object's input Transfer structure. After performing the relevant HTTP or SSH batch request and receiving a response from the remote server, the Batch() methods then iterate over the list of objects in the server's reply and set each object's output Transfer structure's Missing element from the value found in the local map. This design dates from the original implementation for the HTTP-based Git LFS transfer protocol in commit 1412d6e of PR git-lfs#3634, and was then implemented for the SSH-based protocol in commit 594f8e3 of PR git-lfs#4446, when that protocol was introduced. However, we can simplify this design by eliminating the need to create and reference local maps in the Batch() methods for each transfer protocol, if we instead make use of the map of object IDs to lists of objectTuple structures held in the "transfers" element of the TransferQueue structure. This "transfers" element contains a map from object IDs to "objects" structures, which in turn have "objects" elements that are lists of objectTuple structures. These are populated by the remember() method of the TransferQueue method, which is called by the Add() method to record all the input data it is passed when an object is added to the transfer queue. One potential complication is that the "objects" structure allows for multiple objectTuples to be recorded for a common object ID, in case the Add() method is called repeatedly for the same Git LFS object. Hypothetically, this could allow for different values of the "missing" argument to the Add() method to be recorded for the same object ID during the operation of the transfer queue. In practice, though, this is not possible, because the Add() method will only be called once per unique object ID during push operations. When a "git lfs pre-push" or "git lfs push" command runs, the UploadPointers() method of the uploadContext structure in the "commands" is the only caller of the Add() method of the TransferQueue structure, and the UploadPointers() method only performs that call for the object IDs returned by the prepareUpload() method, which it calls before the TransferQueue's Add() method. In the common case where a Git LFS push command scans through the history of one or more Git references to find Git LFS pointers, the UploadPointers() method of the uploadContext structure is called for each individual pointer found in the Git history. As a consequence, the objects are passed individually to the prepareUpload() method. That method checks if the given object have been processed already by calling the HasUploaded() method, and if that returns "true", the prepareUpload() method returns nothing, so the UploadPointers() method in turn does not invoke the UploadPointers() method for that object. If the HasUploaded() method returns "false", though, the object is returned by the prepareUpload() method, so the UploadPointers() method invokes the TransferQueue's Add() method, and then calls the SetUploaded() method to register the object ID in the set of IDs that have been processed so far. Note that is unlikely for multiple Git LFS pointers with the same object ID to be found in a repository's Git history in the first place, because these pointers will normally be identical to each other, and so will have identical Git blob SHA-1 values. Therefore when we run "git rev-list --objects ..." to retrieve the Git blobs reachable from a set of Git references, Git LFS pointers with the same Git LFS object IDs will normally only be represented by a single Git blob. However, as demonstrated by our t/t-duplicate-oids.sh test script, it is possible for Git LFS pointers with legacy values for their "version" field to exist, which may then reference the same SHA-256 Git LFS object ID as a modern Git LFS pointer but be stored as distinct Git blobs. Similarly, it is possible for a Git LFS pointer with extension fields to resolve to the same final content data as a pointer without those fields, or with different extension fields, and so could also result in distinct Git blobs that are all valid Git LFS pointers with the same SHA-256 object IDs. Aside from the common case where a Git LFS push command scans through the history of one or more Git references to find Git LFS pointers, the "git lfs push" command may also be invoked with its --object-id or --stdin options, along with a list of specific Git LFS object IDs to be uploaded. When handling these command-line options, the uploadsWithObjectIDs() function in our "commands" package will invoke the UploadPointers() method of the uploadContext structure with the full set of Git LFS object IDs passed to the command. Because a user may inadvertently supply the same object ID more than once, this means duplicate IDs may be passed by the UploadPointers() method to the prepareUpload() method, in which case that method's call to the HasUploaded() method would return the same "false" value for all the duplicate object IDs in the list. Hence, this check is not sufficient in this case to avoid duplicate IDs being returned by the prepareUpload() method. Fortunately, though, the prepareUpload() method performs another round of de-duplication using a local StringSet from the "tools" package of the standard Go library. Each object ID is added to this "uniqOids" set as it is processed, unless the set already contains that object ID, in which case the object is skipped. This second level of de-duplication logic ensures that it is not possible for UploadPointers() method to receive object IDs from the prepareUpload() method unless they have not been processed before. This in turn guarantees that the Add() method of the TransferQueue structure will never be called for the same Git LFS object ID more than once during a push operation. (As a sidebar, the use of the "uniqOids" set to de-duplicate objects during push operations was originally introduced in commit d770dfa of PR git-lfs#1600 to guard against duplicate object IDs which arose from distinct Git LFS pointers with the same SHA-256 object IDs, as can result from the use of legacy pointer "version" values. At the time, the prepareUpload() method was called by an upload() function rather than the newer UploadPointers() method. During normal "git lfs pre-push" and "git lfs push" commands, without the --object-id or --stdin options, this upload() function would be passed all the Git LFS objects found from scanning the appropriate Git history, and so might encounter duplicate object IDs which would not be filtered by checking the set maintained by the SetUploaded() method, since that was only called after the Add() method of the TransferQueue structure. Later, with the changes in PR git-lfs#1953, the logic changed such that objects are only handled individually during push operations when the --object-id and --stdin options are not supplied, and since then the checks of the "uniqOids" set in the prepareUpload() method have only guarded against duplicate object IDs provided via those command-line options.) As described above, we can rely on the object ID de-duplication in the "commands" package to prevent the same object being requested for upload more than once. This ensures that the "objects" structure found in the "transfers" map element of the TransferQueue structure will only contain a single objectTuple. Therefore, in our revisions to the enqueueAndCollectRetriesFor() method, we can simply check the Missing element of that first (and only) objectTuple to determine whether the "missing" argument was set to "true" when the Add() method of the TransferQueue structure was called for that unique object. (Note that in PR git-lfs#2476, the "transfers" structure was updated to maintain a list of objectTuple structures for each object ID, in order to handle duplicate OIDs passed during "smudge" filter operations using Git's "delay" capability for long-running filter drivers like the "git lfs filter-process" command. This filter never handles upload operations, though, so the de-duplication logic in the "tq" package is not applicable to push operations, since those are fully de-duplicated in the "commands" package.) In prior commits in this PR we refactored and expanded the tests relevant to the handling of missing objects during push operations, in part to provide additional validation of the changes in this commit, including several new tests in our t/t-push-failures-local.sh which attempt to perform push operations with missing objects while using the SSH-based version of the Git LFS object transfer protocol. In particular, the "push reject missing object (lfs.allowincompletepush false) (git-lfs-transfer)" test checks that when an object is missing locally, and is detected as such by the ensureFile() method of the uploadContext structure because there is also no file in the working tree at the path associated with the object's Git LFS pointer, the Git LFS client abandons the push operation immediately and does not proceed to try to upload either the missing object or any other objects. The test validates this behaviour by checking that no objects are present on the server after the "git push" command exits, and also by looking for error and trace log messages the client should output when it halts the push operation. We added the trace log message in a previous commit in this PR because we intend to revise the error message in a subsequent commit, which will make the message more consistent, so it is not specific to the case where no file is found in the working tree, but will also mean we cannot use the revised error message as an indicator that the client has halted the push operation where we expect that it should.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This updates the pre-push hook to verify that any changes to non-lfs objects are not locked by other users. The Git Scanner will already ignore any blobs if they are over the size threshold, or if the contents are not parseable LFS pointers. So, this PR does the following:
FoundLockablecallback on GitScanner. Since only the pre-push command cares, other Git Scanner uses don't need to bother setting this.catFileBatchCheckScannerencounters a file over the blob size cutoff, it gets the filename from theScanRefsOptions(which therevListScannerfills), and then sees if that filename is locked by another user. If so, it writes to a chan that eventually calls theFoundLockablecallback.catFileBatchScannercannot parse a valid LFS pointer, it also gets the filename from theScanRefsOptionsand sees if that filename is locked by another user.