Skip to content

[net, ktcp] Enhancements: Speed, retransmits#68

Merged
Mellvik merged 1 commit intomasterfrom
lance
Jul 15, 2024
Merged

[net, ktcp] Enhancements: Speed, retransmits#68
Mellvik merged 1 commit intomasterfrom
lance

Conversation

@Mellvik
Copy link
Owner

@Mellvik Mellvik commented Jul 10, 2024

This PR fixes the ktcp performance problem described in #67 by increasing the allowed max send window. The problem popped up only on relatively fast systems and only if there was a router between the TLVC system and the peer. The fix eliminates a wait/wakeup cycle and allows the TLVC sender to send packets more or less continuously, reaching the same level of performance as if the hosts were on the same network segment.

Also fixed is an old ktcp problem which caused lots of unneccessary retransmissions when sending to slow host. The retransmit timeout calculation simply didn't work in such settings. The fix is to check the roundtrip time before setting the first retransmit timeout. If the peer system is found to be very slow or heavily loaded, double the initial timeout.

Included:

  • The delay kludge previously added to the ee16 and the el3 drivers to get around the slow-send problem has been removed.
  • Shortening the read_select wait timeout from 100ms to 50ms did speed things up, but not as much as the selected solution, increasing the allowed retransmit window. Also, the shortened wait timeout seems to create deadlocks in certain situations. A comment about this has been added to the af_inet.c file.

@ghaerr
Copy link

ghaerr commented Jul 11, 2024

Nice working solving these problems @Mellvik! Fixes look good to me. I am a bit interested to know whether ktcp shows any signs of running out of memory with the expanded send window and max retransmit queue sizes, given the number of simultaneous connections you're testing with.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 11, 2024

Thanks, @ghaerr.
The answer is no, but here's the thing: Testing (really, pushing) this requires more processes than the system has available. Which arguably means that everything is fine with ktcp since there is another resource running before it reaches its limits. Not good enough though - and I'm adding dynamic task array from ELKS as we speak.

Two heavy outgoing file transfers now run fine at great speed - same segment and through the router, even if disturbed by other activities. The litmustest that really pushes ktcp and the retransmit implementation is 3 incoming telnet sessions running hd on some file - concurrently. Adding floppy IO to the equation and we have some rather unpredictable (and big) delays. Unrealistic load, but this is as much about robustness as it is about capability.

IOW - the jury is still out ...

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 11, 2024

@ghaerr,
Turns out I was too fast on declaring the retransmit problem between slow systems as fixed. It is not, and it runs deeper than just the RTT/RTO values. A classic 'state' problem - works in this case, but not in that.

As to the dynamic task array: I ported it and it seems to work fine, but as we're running low on memory, processes start to hang. The ktcp problem prevented me from diving into this one right away, but just out of curiosity - and before diving in: How much has this feature been tested? My setting is 20 tasks by the way.

@ghaerr
Copy link

ghaerr commented Jul 11, 2024

I was too fast on declaring the retransmit problem between slow systems as fixed.

Oh geez, that's too bad. It looks as though you're on to something though. This first fix only doubles the initial RTO in one case, perhaps RTO needs to be calculated over a longer time period. I don't have enough data on how RTO and retransmit works to suggest anything intelligent at this point.

How much has this feature been tested? My setting is 20 tasks by the way.

I tested the feature very heavily and believe it works quite well, as I was careful not to change anything structurally other than using a pointer for struct task_struct *task rather than having it be static declaration of an array. However - the full change may have occurred over more than one commit, so check that. The other huge issue is that the task array is still allocated out of kernel data segment (heap) and each task struct is large 700+ bytes stack, etc. To more smoothly determine whether dynamic tasks work on TLVC, I would suggest starting at 16 and seeing no problems, then inching upwards.

Moving the task array into far memory and/or allocating different stack sizes per task turns out to be extremely complicated, so unfortunately tasks remain a heavy user of kernel heap. I added dynamic inodes later in an attempt at allowing a smaller inode store to increase kernel memory for tasks. Ultimately, all of this points to the huge need to get all driver buffers out of the kernel heap unless absolutely necessary. A rule of thumb would be one kernel buffer is about the same as an additional task in terms of data usage.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 12, 2024

OK, I'm beginning to understand how ktcp does its timeout math and why it doesn't match reality when the peer is (very) slow:

  1. The RTT (which is used to calculate the RTO) is initially set high when the connections is opened and the CB allocated. Then as the packets flow, the RTT and RTO are adjusted down to what seems correct for the connection. In most cases the RTT ends up at 1 (as in 1/16th of a second) on a LAN.
  2. The calculation is done by looking at the time a packet was sent and the time we discover it has been acked. An approximation but fair enough. It delivers a reasonable average RTT.
  3. What happens when the peer is really slow is that packets build up in the retransmit buffer. These packets have obviously not been acked, and thus their RTT has not been calculated - and not used. When the buildup in the retransmit buffer reaches its max per CB, ktcp waits and the buffer eventually drains.
  4. Seemingly all good, but there is a problem - actually two: This works only if the RTO, the retransmit timeout is high enough to prevent retransmits from kicking in before the peer gets the ACKs going. Such high RTO is what we have initially, right after creating the connection and starting with a fresh CB. SO: high RTO (i.e. no retransmits) and this works. But the RTO goes down fast, along with the RTT as described above. (It's actually quite interesting to see how it comes down and adjusts itself.) Problem 2 is that in the 'normlized' setting, with RTT and RTO established (and low), the RTO will timeout and the sender will dump its entire retransmit buffer at the poor peer, who hasn't had time to ACK the first ones yet. This works too, but becomes really slow - not to mention the mess on the wire.

IOW - the RTT/RTO mechanism simply doesn't lend itself to this particular scenario and we need to come up with something else, either a different mechanism or an add on, the latter being the preferred until we see how this can work in practice. I found out from commented-out code in ktcp that I was en route to a fix about a year ago when I first discovered this problem, but got distracted - and I didn't understand it completely then. Still, I added a per-CB counter to keep track of how many packets this CB has in the retransmit buffer at any point. Like tcp_timeruse except per CB instead of total. Using this, a buildup of unacked packets can trigger a temporary increase in RTO for the CB.

tcp_timeruse is the variable I have been using when tracing this. It works like the counter above as long as only one connection is active. Thus a simple next step is to monitor tcp_timeruse instead, and add it - possibly with some weighting factor - to the RTO, the RetransmitTimeOut. Alternately, a RTO-modification/scaling factor could be added in tcpdev.c at the point where TCP_SEND_WINDOW_MAX is being checked. It occurs to me as I'm writing that this is the simplest starting point.

Anyway, now that the picture is clear(er), it is possible to think about possible solutions.

@ghaerr
Copy link

ghaerr commented Jul 12, 2024

@Mellvik, super writeup and explanation on RTT/RTO. I don't think I ever fully understood it myself, but this helps.

To help me understand, is this PR fix trying to solve multiple problems (e.g. increasing max send window to allow for more time for ACKs so that efficiency is increased, and also trying a new RTT/RTO algorithm because of our previous multiple-ACK problem/hack), or something else?

It seems the initial issue was that you thought the send window size increase could solve the efficiency issue and allow removal of the read delay kluge hack. Did that get partially solved, or are you finding that somehow retransmits are getting involved in all of this, thus leading to the RTT/RTO calc issues?

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 12, 2024

Actually, @ghaerr - with the upcoming update to the pr, both are fixed. It is purely incidental that the two fixes virtually end up in adjacent lines in the code.

The original problem was fixed in the pr as is and it's truly a relief to have found and fixed (testing now) the second one - finally. The fix needs testing in a real high latency environment to make sure nothing is broken. Planned for tomorrow.

Then back to the dynamic task array testing. Thanks.

@ghaerr
Copy link

ghaerr commented Jul 12, 2024

Still, I added a per-CB counter to keep track of how many packets this CB has in the retransmit buffer at any point. Like tcp_timeruse except per CB instead of total. Using this, a buildup of unacked packets can trigger a temporary increase in RTO for the CB.

This sounds like a potentially good idea. If the problem is that the retransmit buffer is unloading everything all at once when a single RTO expires, perhaps the first retrans packet should be sent, and then either the remaining RTT/RTOs adjusted and/or the send window itself decreased so that the retrans packets go out more slowly. That is, when the machine-to-machine communication "clogs up" and effectively halts, awaiting a retransmit (or provoking other problems with a faked double ACK), the sender might want to start sending quite differently than it otherwise would have, by starting quite slow with a single packet, then speeding up slowly. This could possibly be done using RTT/RTO fixups like you're thinking but applying them to all the packets in the retransmit queue as well, and adjusting their first/next retrans start time.

I hope I'm making sense here, I probably need to dive in to help more specifically. I sure wish there was an easy way to duplicate this using emulators, as of course localhost connections don't even hit the NIC cards.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 13, 2024

Thanks @ghaerr - you are making sense.
I'm experimenting with various mechanisms righty now, and sometimes discovering behaviour I still don't quite understand. More later.

-M

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 13, 2024

Still, I added a per-CB counter to keep track of how many packets this CB has in the retransmit buffer at any point. Like tcp_timeruse except per CB instead of total. Using this, a buildup of unacked packets can trigger a temporary increase in RTO for the CB.

This sounds like a potentially good idea. If the problem is that the retransmit buffer is unloading everything all at once when a single RTO expires, perhaps the first retrans packet should be sent, and then either the remaining RTT/RTOs adjusted and/or the send window itself decreased so that the retrans packets go out more slowly. That is, when the machine-to-machine communication "clogs up" and effectively halts, awaiting a retransmit (or provoking other problems with a faked double ACK), the sender might want to start sending quite differently than it otherwise would have, by starting quite slow with a single packet, then speeding up slowly. This could possibly be done using RTT/RTO fixups like you're thinking but applying them to all the packets in the retransmit queue as well, and adjusting their first/next retrans start time.

I hope I'm making sense here, I probably need to dive in to help more specifically. I sure wish there was an easy way to duplicate this using emulators, as of course localhost connections don't even hit the NIC cards.

You're definitely on to something, @ghaerr. Got me thinking through the whole thing again. It's tempting now that the problem is (more ) understood, to throw in a fix and move on. Only to find that the not-so-well understood consequences bite back later.

I've rigged my test environment to have a machine in Amsterdam log into the local TLVC system to get a picture of 'normal' jitter, delays and occasional retransmits. ktcp handles it well and this becomes - to me - a baseline, as in 'it ain't broken, don't fix'.

Btw - from the above, ktcp does not really dump all the packets in the retrans buffer on the recipient, it just looks that way. They all time out, because the timeout is too short. So the idea of letting the first retrans go and then adjust the timeout in the remaining packets is in fact very viable. More about that later.

Here's the thing, the challenge: we're dealing with a special case, one that breaks the rules of 'normality' sufficiently to create real problems. How do we (1) recognize this particular situation and (2) handle it?

Before doing more experimentation (not that it has been in vain, I've learned a lot), those 2 problems must be assessed, the first one by carefully studying the logs/traces (again). I'll have something to that effect tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants