Skip to content

Conversation

@ConcurrentCrab
Copy link
Contributor

@ConcurrentCrab ConcurrentCrab commented Jun 30, 2024

Commit 44b8801 introduced a bug where connections were being created on-demand but were not being stored in the connections slice.

When trying to end connections at the end of a transfer, there were no handles to them. As a result, all but the primary connection of the session were hung with no "quit" message. LFS clients (and servers) were simply hung until idle timeout kicked in.

Found this while implementing pure SSH protocol for Gitea (go-gitea/gitea#17554).

This seems like important functionality, so I'd recommend adding some sort of pure SSH multiplexed tests in the suite as well.

Commit 44b8801 introduced
a bug where connections were being created on-demand
but were not being stored in the connections slice.

When trying to end connections at the end of a transfer,
there were no handles to them. As a result, all but the
primary connection of the session were hung with no "quit"
message. LFS clients were simply hung until idle timeout
kicked in.
@chrisd8088
Copy link
Member

Hey, thanks for diagnosing the problem and contributing a fix! This looks obviously correct to me.

Once our CI is back in good shape I'll run the test suite.

I think your suggestion of adding some tests to validate this behaviour is a very good one. I don't want to impose on your time, but since you're working in this area, if you have any ideas or notes on how you simulated this problem, those would be very helpful. Ideally we'd take an existing test like the batch transfers with ssh endpoint (git-lfs-transfer) test and run another version of it which failed without your change, and succeeded with it.

@ConcurrentCrab
Copy link
Contributor Author

ConcurrentCrab commented Jul 3, 2024

@chrisd8088 the test case you linked looks like it was close to catching this. The trick is, multiplexing only kicks in when there are >1 LFS objects to upload (since the flow for a single file has to be a linear batch -> upload -> verify). I found this problem while trying uploads of multiple large files, but it's reliably reproducible even with just 2 tiny test files.

Just adding multiple files to the test commit would be enough as a test case. Unfortunately both known implementations of git-lfs-transfer treat a timeout EOF as a graceful exit, so it won't exactly fail (not sure how to do that), but a running time of 2 minutes should indicate that something is awry :P

I tried adding a copy of that test, only with multiple LFS files, but that seems to fail in a new, entirely unexpected way:

#    [...] trace git-lfs: pure SSH connection successful
#    [...] trace git-lfs: failed to spawn pure SSH connection: Unable to negotiate version with remote side (unable to read capabilities): EOF
#    Failed to connect to remote SSH server: expected "/run/user/[...]/sock-[...]/lfs.sock" to exist
#    [...] trace git-lfs: shutting down pure SSH connection

Since it mentioned a socket, I suspected this is another multiplexing issue (as it uses a control socket to coordinate between the workers). And sure enough, adding a: git config lfs.ssh.automultiplexing false before the push lets the test past the push, though it later seems to fail doing the git lfs fsck (? looks like it at least).

I tried looking more into it but I don't think I'm nearly familiar enough with your test harness to be able to figure out why the the git-lfs process isn't able to access that control socket.

@chrisd8088
Copy link
Member

I tried looking more into it but I don't think I'm nearly familiar enough with your test harness to be able to figure out why the the git-lfs process isn't able to access that control socket.

OK, thanks for all the info! I'll try taking a look at this over the next while to see if I can reproduce that and also craft a test to demonstrate the fix in this PR.

@chrisd8088
Copy link
Member

Just FYI, I haven't forgotten about this PR; I've been busy with $DAYJOB and haven't managed to finish getting the CI suite back into good shape, although that work is nearly ready. Once I've got that done I'll look at this PR next.

Thank you for you patience and your contribution!

@ConcurrentCrab
Copy link
Contributor Author

It's cool, I knew you wouldn't forget :)

@6543
Copy link

6543 commented Sep 12, 2024

Any update?

@chrisd8088
Copy link
Member

chrisd8088 commented Sep 12, 2024

Any update?

As I wrote above, I've been working on getting our CI suite back to 100% success, and I have tracked down all the bugs and fixed them, but not yet had time to create a series of PRs for those changes. I hope to have that done in the next couple of weeks. Once those are merged, we can run this PR and several others against those repaired tests, so we'll have valid CI results for Windows and macOS (these are both broken right now in CI, for different reasons).

I appreciate everyone's patience while we get our CI jobs into good shape again.

@chrisd8088
Copy link
Member

Just a quick update on the state of our CI suite: PR #5866 fixed the macOS jobs, PR #5868 should fix the Windows jobs, and PR #5872 will upgrade us to Go 1.23, which should fix the Build with specific Go jobs. With any luck, we'll be back to 100% success after those are merged, and which point we can start pulling in other contributions again.

Thanks very much for your patience!

Copy link
Member

@chrisd8088 chrisd8088 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the bug report and patch! This is definitely a valuable fix.

FWIW, I spent some time this weekend refreshing my memory of the SSH connection code, and turned up a number of issues aside from this one. I've written up the most significant bug in #5880; it tracks with what you described in #5816 (comment), I think.

The upshot, though, is that it's not immediately possible (so far as I could tell) to write a nice test which would demonstrate the problem this PR addresses.

Since the problem is clear, though, and the fix looks entirely correct, I'll just merge this PR in as it stands and we'll try to craft a more thorough multiple-object batch SSH transfer test in subsequent PRs.

@chrisd8088 chrisd8088 merged commit 2daf828 into git-lfs:main Oct 7, 2024
@6543
Copy link

6543 commented Oct 7, 2024

thanks for the fix! & thanks for the maintenance work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants