-
Notifications
You must be signed in to change notification settings - Fork 2.2k
ssh: fix connection creation "leaking" connections #5816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Commit 44b8801 introduced a bug where connections were being created on-demand but were not being stored in the connections slice. When trying to end connections at the end of a transfer, there were no handles to them. As a result, all but the primary connection of the session were hung with no "quit" message. LFS clients were simply hung until idle timeout kicked in.
|
Hey, thanks for diagnosing the problem and contributing a fix! This looks obviously correct to me. Once our CI is back in good shape I'll run the test suite. I think your suggestion of adding some tests to validate this behaviour is a very good one. I don't want to impose on your time, but since you're working in this area, if you have any ideas or notes on how you simulated this problem, those would be very helpful. Ideally we'd take an existing test like the |
|
@chrisd8088 the test case you linked looks like it was close to catching this. The trick is, multiplexing only kicks in when there are >1 LFS objects to upload (since the flow for a single file has to be a linear batch -> upload -> verify). I found this problem while trying uploads of multiple large files, but it's reliably reproducible even with just 2 tiny test files. Just adding multiple files to the test commit would be enough as a test case. Unfortunately both known implementations of I tried adding a copy of that test, only with multiple LFS files, but that seems to fail in a new, entirely unexpected way: Since it mentioned a socket, I suspected this is another multiplexing issue (as it uses a control socket to coordinate between the workers). And sure enough, adding a: I tried looking more into it but I don't think I'm nearly familiar enough with your test harness to be able to figure out why the the |
OK, thanks for all the info! I'll try taking a look at this over the next while to see if I can reproduce that and also craft a test to demonstrate the fix in this PR. |
|
Just FYI, I haven't forgotten about this PR; I've been busy with Thank you for you patience and your contribution! |
|
It's cool, I knew you wouldn't forget :) |
|
Any update? |
As I wrote above, I've been working on getting our CI suite back to 100% success, and I have tracked down all the bugs and fixed them, but not yet had time to create a series of PRs for those changes. I hope to have that done in the next couple of weeks. Once those are merged, we can run this PR and several others against those repaired tests, so we'll have valid CI results for Windows and macOS (these are both broken right now in CI, for different reasons). I appreciate everyone's patience while we get our CI jobs into good shape again. |
|
Just a quick update on the state of our CI suite: PR #5866 fixed the macOS jobs, PR #5868 should fix the Windows jobs, and PR #5872 will upgrade us to Go 1.23, which should fix the Thanks very much for your patience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the bug report and patch! This is definitely a valuable fix.
FWIW, I spent some time this weekend refreshing my memory of the SSH connection code, and turned up a number of issues aside from this one. I've written up the most significant bug in #5880; it tracks with what you described in #5816 (comment), I think.
The upshot, though, is that it's not immediately possible (so far as I could tell) to write a nice test which would demonstrate the problem this PR addresses.
Since the problem is clear, though, and the fix looks entirely correct, I'll just merge this PR in as it stands and we'll try to craft a more thorough multiple-object batch SSH transfer test in subsequent PRs.
|
thanks for the fix! & thanks for the maintenance work! |
Commit 44b8801 introduced a bug where connections were being created on-demand but were not being stored in the connections slice.
When trying to end connections at the end of a transfer, there were no handles to them. As a result, all but the primary connection of the session were hung with no "quit" message. LFS clients (and servers) were simply hung until idle timeout kicked in.
Found this while implementing pure SSH protocol for Gitea (go-gitea/gitea#17554).
This seems like important functionality, so I'd recommend adding some sort of pure SSH multiplexed tests in the suite as well.