Skip to content

Conversation

@redsun82
Copy link
Contributor

@redsun82 redsun82 commented Jan 21, 2025

This was modelled after the git-lfs push --dry-run option. I've removed some apparently unused code:

  • computing and reporting of ready nodes
  • deduplication of multiple objects with the same OID (as far as I can tell, that is already carried out by the transfer queue class, so no need to redo it in the fetch command)
  • the output channel of the main fetch function which was always nil.

For testing, I added --dry-run testing to a couple of cases, but I wonder if we should add it for all cases.

This is a part of what was discussed in #5888.

This was modelled after the `git-lfs push --dry-run` option. I've
removed some apparently unused code (computing and reporting of ready
nodes, and the output channel of the main fetch function which was
always `nil`).

For testing, I added `--dry-run` testing to a couple of cases, but I
wonder if we should add it for _all_ cases.

This is a part of what was discussed in
git-lfs#5888.
@chrisd8088
Copy link
Member

Hi @redsun82! Thanks again for all these PRs! I am starting to work my way through them and thought I'd begin here. I just wanted to let you know I haven't forgotten about these PRs and that we really appreciate all your hard work (and patience!)

Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great, thank you so much! We also really appreciate the removal of unnecessary, stale code; that's a very pleasant bonus.

I had a few small suggestions before we merge this PR, but other than those minor details, I think this is all set. Thank you again!

cfg.Remote(), tq.WithProgress(meter), tq.DryRun(fetchDryRunArg),
)

if out != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great simplification, thank you! It looks as though the last use of the out parameter was dropped in commit 8adf6ec of PR #1771 way back in 2016, so removing it is long overdue!

t/t-fetch.sh Outdated
grep "Git LFS fsck OK" fsck.log

git lfs fetch --dry-run 2>&1 | tee fetch.log
! grep "fetch .* => a\.dat" fetch.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ! negation here does work as expected, because this happens to be the last command in the test. However, it's a tricky situation because any command that we add in the future which follows this one will cause the check to become one which passes in all cases, even if the grep finds the pattern in the file and succeeds, meaning the ! should result in a failure exit code from the pipeline.

That would happen because we have the set -e option set, and as Bash's documentation states, that option has the following behaviour:

Exit immediately if a pipeline ... returns a non-zero status.
The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command’s return status is being inverted with !.

Over time we inadvertently accumulated a number of these silently non-functional checks due to the use of negated pipelines, which we resolved in commit a5d20de of PR #5282.

In order to try to avoid that problem in the future, I'd suggest we change this check and the one in the test below to use a different idiom:

Suggested change
! grep "fetch .* => a\.dat" fetch.log
grep "fetch .* => a\.dat" fetch.log && exit 1
true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I actually though that a ! command would count as a single command inverting the exit code as far as -e was concerned, but I stand corrected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually though that a ! command would count as a single command inverting the exit code as far as !-e! was concerned

I think pretty much everyone who's worked on this project, including myself, has had the same expectation. And it's sneaky because a command with !, if it's the last one in the test, it all appears to work as you'd expect: the test fails if the command succeeds and ! inverts the exit code. But that's only because the inverted exit code from that last command is what's returned from the test; -e doesn't get involved in this special case.

I like the command && exit 1 || true technique you've used. It's a little tidier than the two-line idiom we used in a number of other places, and I don't think it will cause any problems if someone later adds subsequent commands to the test and forgets to remove the || true from the preceding command list.

@chrisd8088
Copy link
Member

Intriguingly, we're getting a flaky test result from the modified fetch with remote and branches test in the t/t-fetch.sh script, where in one of the Linux package builds the fetch.log appears to be:

#     fetch: Fetching reference refs/heads/main
#     fetch ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb => a.dat
#     fetch: Fetching reference refs/heads/newbranch
#     fetch ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb => a.dat

The lack of the expected output line for b.dat means the test fails. I'm not sure what's causing that difference from all the other CI runs, but we'll have to figure that out and try to get the output and the test to have consistent results.

@redsun82
Copy link
Contributor Author

redsun82 commented Feb 4, 2025

Intriguingly, we're getting a flaky test result from the modified fetch with remote and branches test in the t/t-fetch.sh script, where in one of the Linux package builds the fetch.log appears to be:

#     fetch: Fetching reference refs/heads/main
#     fetch ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb => a.dat
#     fetch: Fetching reference refs/heads/newbranch
#     fetch ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb => a.dat

The lack of the expected output line for b.dat means the test fails. I'm not sure what's causing that difference from all the other CI runs, but we'll have to figure that out and try to get the output and the test to have consistent results.

Hmm, that's indeed odd. I couldn't reproduce locally... Yeah, needs some more looking into.

@redsun82
Copy link
Contributor Author

redsun82 commented Feb 5, 2025

@chrisd8088 oh silly me, I had added the printing coroutine without waiting for it to finish. Should be fixed now!

@chrisd8088
Copy link
Member

chrisd8088 commented Feb 5, 2025

I had added the printing coroutine without waiting for it to finish.

Indeed, I just discovered that same issue. Adding a time.Sleep(5 * time.Second) into the original version of printTransfers() after each Print() call makes the problem reproducible on a faster CPU, and it explains why we only saw the flake on some of our slowest CI jobs where we're running an old Linux version in a Docker container.

Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling the flaky test problem!

I think we'll have to make a few more changes, as I've suggested, to avoid the reverse situation where the command never finishes in the non-dry-run case. But with those additional changes, I think we'll be all set.

I'm happy to push a commit to your branch, if you grant me permission; otherwise I'll leave this in your hands. And thank you again very much for working on these PRs!

@redsun82
Copy link
Contributor Author

redsun82 commented Feb 5, 2025

@chrisd8088 I've implemented your suggestions, but also went ahead and gave you access to my fork, so feel free to add any commits! Thanks for the feedback, it's good to brush up on some Go knowledge 🙂

In prior commits in this PR we refactored and simplified the
fetchAndReportToChan() function in our "commands" package so it is
now just called fetch(), and also revised this function to introduce
support for a new --dry-run option to our "git lfs fetch" command.

When the --dry-run command-line option is specified, the fetch() function
starts an anonymous goroutine to output the path and object ID of each
Git LFS object the command would normally retrieve via a transfer queue.

At the moment, this goroutine calls the Watch() method of the TransferQueue
structure in our "tq" package to create and register a "watcher" channel
with the queue, and then starts a loop to read the values sent over the
channel by the queue, each of which represents an object the queue
would transfer if it was not executing in a dry-run mode.

So long as the goroutine calls the Watch() method before the transfer
queue begins operating, this design works as we intend.  However, if
the goroutine's execution is delayed by the Go scheduler, the queue
may perform all of its actions before the goroutine calls the Watch()
method, and the fetch() function could invoke the Wait() method of
the TransferQueue structure before the goroutine calls the Watch()
method.

Should this happen, the "watcher" channel created by the Watch()
method will never be closed, because that only happens in the Wait()
method, and that method has already run at a time when there were
no extant channels to close.  Therefore the goroutine's loop over the
channel will never exit, and the goroutine will never decrement the
count of the WaitGroup structure by calling its Done() method, and in
turn the fetch() function's call to the WaitGroup's Wait() method will
never complete, so the "git lfs fetch" command will stall indefinitely.

To ensure we handle this scheduling condition, we create the "watcher"
channel in the fetch() function before starting our anonymous goroutine,
so the channel will always be registered with the TransferQueue before
its Wait() method is called.  Thus the channel will always be found
and closed by that method, and the goroutine cannot enter an infinite
wait state on the channel.
Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good!

Thank you for the invitation to collaborate; I've pushed up what I hope is one final change to deal with an edge case where the queue's Wait() method is called before the anonymous goroutine has a chance to register a new watcher channel. Otherwise, I found that inserting a time.Sleep(1 * time.Second) at the start of the goroutine was enough to make the command stall indefinitely.

out <- p
}
defer wg.Done()
for p := range q.Watch() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this myself last night, and noticed that if I injected an artificial delay before the for ... range loop, as if the goroutine was not dispatched immediately, the loop never exited and the command would stall indefinitely.

I thought I'd worked around this by calling q.Watch() in advance of the loop, and assigning it to a variable, but in the fresh light of morning I see that that actually makes no difference. :-)

What we need to do instead is call q.Watch() before starting the goroutine at all. Otherwise, there's an edge case where the goroutine doesn't start executing immediately, and the entire dry-run transfer process completes in the meantime. In this instance, the q.Wait() call below is made before the goroutine starts, and because it finds no watcher channels (as the goroutine hasn't created one yet), it returns without closing any.

Since q.Wait() will never be called again to close the channel the goroutine creates, the goroutine's loop never exits, wg.Done() is never called, and wg.Wait() never returns.

I'll push up a commit which makes this change and adds some explanation in the description.

@chrisd8088
Copy link
Member

OK, I think we're all good here ... except that, sigh, our Windows CI job is failing for an unrelated reason. I'll work on getting that fixed ASAP, and then we can re-run the CI jobs here and merge this.

@chrisd8088 chrisd8088 merged commit fef3ea6 into git-lfs:main Feb 6, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants