Skip to content

[ktcp] New fix for minimizing retransmits to slow peers#69

Merged
Mellvik merged 3 commits intomasterfrom
misc
Jul 22, 2024
Merged

[ktcp] New fix for minimizing retransmits to slow peers#69
Mellvik merged 3 commits intomasterfrom
misc

Conversation

@Mellvik
Copy link
Owner

@Mellvik Mellvik commented Jul 15, 2024

A new fix for the ktcp problem discussed at length in #68: excessive retransmits when talking to very slow systems (like TLVC talking to TLVC or ELKS).

The scenario is this: A very slow system, say an XT or a 286 AT, telnets to another TLVC/ELKS host and does something that creates a lot of output, like hd /bin/ps for example. The server will deliver reliably but slow because the ktcp retransmit timeouts don't work with such slow clients. A similar pattern may be observed when sending files to such as slow system (ftp).

After understanding the details of the problem (again, see #68), and testing several fixes that worked but have drawbacks, such as not being able to adapt to the degree of slowness on the client side, here's a solution that seem to satisfy reasonable demands, including not breaking anything(!).

The primary challenge was to find a reliable metric for the (network) performance of the client. The solution turned out to be hiding right under my nose: When the send window closes (the retransmit buffer is full), ktcp will 'loop' that particular connection until the window opens up again. Counting the number of times it loops turns out to be the metric we're looking for.

The fix monitors the loop and establishes a performance index for the client using the max # of consecutive loops recorded. It usually takes 2-3 attempts to get there and a few retransmits will be sent in the meantime, but the cost seems acceptable given the dynamics the solution brings to the table. Like - a 286AT will get an index of 6 give or take 1, and XT (4.7MHz) will end up between 13 and 15 - for example. Most systems will have a speed index of 0 - or, if on a long or slow connection, 1, which does not affect normal operation. The speed-factor is then used to increase the retransmit timeout for the connection without touching the regular RTT and RTO values.

The price for the fix is 2 additional bytes in the CB struct and a little bit of code.

@ghaerr
Copy link

ghaerr commented Jul 15, 2024

Wow, this sounds like a great fix, especially since it doesn't really change anything until the forced kernel wait has occurred.
Super glad to see an intelligent send back-off algorithm for a problem that has occurred for a long time. So does this and #68 fix all your present concerns with ktcp-to-ktcp transfers? I'll plan on adding the fixes to ELKS after your testing, in advance of the NIC compatibility work, with your go-ahead.

I notice though the the INITIAL_TIMEOUT_RTT is changed from 1 second to 1/4 second... does this help get things moving initially? I seem to remember having had to increase it to 1s, possibly for this same behavior, quite a while back. I'm interested in your comments on the initial timeout.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 15, 2024

Wow, this sounds like a great fix, especially since it doesn't really change anything until the forced kernel wait has occurred. Super glad to see an intelligent send back-off algorithm for a problem that has occurred for a long time. So does this and #68 fix all your present concerns with ktcp-to-ktcp transfers?

Thanks, @ghaerr.
Yes, this covers the high-priority issues. There are some adjustments needed for the slowest machines, including faster convergence, I'm still experimenting with that. Also, there is the double ack issue, which I'd love to eliminate. If I get around to it I'll have a quick look tomorrow - just to take advantage of the 'inertia' while the stuff is fresh in memory.

I notice though the the INITIAL_TIMEOUT_RTT is changed from 1 second to 1/4 second... does this help get things moving initially? I seem to remember having had to increase it to 1s, possibly for this same behavior, quite a while back. I'm interested in your comments on the initial timeout.

Thanks for noticing - and mentioning. it's has been part of the experimentation to improve convergence speed. It's definitely going back to 1s.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 17, 2024

An update [developer notes really] pt I:

As I removed the (now) small debug printfs from ktcp, the behaviour for the slowest systems changed. The printf delays apparently allow more ACKs to get through. Anyway, the consequence of removing them was that convergence - the time it takes to determine the speed of the remote system, is slower. In turn it means many more retrans sent before the speed index reaches the correct level. The difference between with and without the printf in add_for_retrans:

    fprintf(stderr, "rtt/rto/tu: %ld/%ld/%d\n", cb->rtt, n->rto, tcp_timeruse);

which is quite small, is 5 retransmits with the printf, >50 without, some times 100s. So the mechanism for detecting peer speed is far from flawless. Initially it seemed strange that this printf would have such big effect, until I realized that add_for_retrans is called before ip_sendpacket which means the printf represents a (small) delay BEFORE sending a packet, allowing the peer a litte more time to process. Again what we're seeing is how critical the rhythm of the flow is for slow systems.

Anyway, not good enough. Off the bat it seemed like counting consecutive retransmits would make a more reliable metric that consecutive waits-for-bnuffer-space calls. But before even trying that, I moved the add_to_retrans call to after the ip_sendpacket call. And voila - the picture changed completely. In fact the behaviour became almost satisfactory for the XT peer, and perfect for everything else. Also, the 'speed-index' is now generally lower, meaning the delay added to the retrans-threshold for slow systems is also lower. Doesn't matter much but it's a positive thing.

The 'almost' for XT systems is that in some cases the speed index ends up just below what it should have been for the retransmits to be eliminated completely after establishing the index ('convergence'). This is ONLY for the very slow systems. We end up with retransmits at regular intervals, like every few seconds - not all that bad, but definitely not good. There is another problem too, although not related to this. If the peer, in addition to being slow, runs on floppies or have mounted floppies AND has sync=?? set in bootopts, the system will occasionally appear almost dead for seconds (even if running async floppy), and the speed-index will skyrocket. Similarly, on such systems, multiple active network connections will have adverse performance effects when running concurrently. Which initially qualifies for just a shrug: so what, a very high speed-index probably doesn't hurt much anyway, does it?

Indeed it does and I don't (yet) understand why. A very high speed-index (indicating a very slow system: normally 8-9 for 12.5MHz 386, 15-18 for XT systems) turns out to cause frequent complete stops in sending, ktcp waiting for retrans buffer space to become available - until a retransmit finally gets it going again - now after several seconds because of the high delay factor. IOW - ktcp gets stuck. It seems we have a situation in which acks don't get read or processed until some (possibly unrelated traffic, but in this case a retransmit) gets the process going again. Far fetched - except this sounds a lot like the situation eventually resolved by the double ack back in the day, and it may actually be the client side that holds back (I haven't looked at tcpdump yet, that would need a change in my network setup). If this is the case, we may accidentally have a real handle on that problem too. It may also be that different timeouts accidentally work together and create havoc. Need to look at that later.

In the meanwhile, a cap on the speed index is required to avoid this situation, which I added. We're getting there.

To be continued.

@ghaerr
Copy link

ghaerr commented Jul 17, 2024

Wow @Mellvik, you're definitely in the middle of some complicated stuff. In general, I agree with the approach you're taking. I haven't looked to see if there are any implications in moving add_to_retrans to after ip_sendpacket, but sounds like that sends the packet first, then adds to retrans list. One would think it would always be better to send first, so not sure why the retrans add was coded that way in the first place.

It appears the longstanding problem of ktcp getting "stuck" is still here... no doubt any solution will likely require a timeout-kickstart in order to get going again in bad sync loss; the question is how long to wait? Which leads to my next thoughts:

So the mechanism for detecting peer speed is far from flawless.

Actually, I believe you're on to a potentially very good solution, since the fix (with the printf) initially seemed to completely fix the problem. I'm thinking that we move away from "counting" loops inside ktcp (which is still a good idea and the basis for the fix) but then move to a real wall time timestamp being compared, rather than adding a somewhat arbitrary loop count times a multiplier to the original timestamp. Perhaps this is essentially the same thing, but the algorithm selecting "what to do" on a retrans may need to be significantly expanded baed on the real wall time passed when examining retrans packets, and possibly grouping them for different actions.

The same examination of real time passed when calculating a timeout-kickstart could also be done: the kickstart timeout could be higher or lower based on the last averaged throughput speed, so that fast transfers restart fast, whilst slower timeout later, fitting the systems involved.

Just some thoughts. The ktcp problem of getting stuck has been around since the beginning. We may need to read up on exactly how send window management is supposed to work, other than the current way of just stopping, and waiting for ACKs or a timeout with too little else. IIRC, there were some important articles by Van Jacobson(?) about sender backoff and RTT calculation back in the day that avoided too-quick changes. I can try to find those, they are mentioned in other TCP sources as well.

the system will occasionally appear almost dead for seconds (even if running async floppy), and the speed-index will skyrocket.

This points to possibly having to keep an "average" calculation, rather than a direct calculation only adding to the last summed value. I ran into this when computing the CPU % usage for ps display - one can't use just the last time tick CPU usage or the value shown is quite inaccurate - instead an average is calculated using a decreasing percentage of the older values while a new value is added. This ended up having to use fixed point math, lets hope VJ's algorithm has a simpler way.

@ghaerr
Copy link

ghaerr commented Jul 17, 2024

Screen Shot 2024-07-17 at 9 44 09 AM

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 18, 2024

Thanks @ghaerr, this is very useful - the discussion itself and its forcing an understandable description of the challenges and potential solutions. Incites double checking, seeing more alternatives, filtering stupidities - you know the drill.

So the mechanism for detecting peer speed is far from flawless.

Actually, I believe you're on to a potentially very good solution, since the fix (with the printf) initially seemed to completely fix the problem. I'm thinking that we move away from "counting" loops inside ktcp (which is still a good idea and the basis for the fix) but then move to a real wall time timestamp being compared, rather than adding a somewhat arbitrary loop count times a multiplier to the original timestamp.

This is interesting. I was on a somewhat similar tangent the other day, not caring much for the inaccuracy introduced by checking wall time only once per ktcp loop. However, but being unsure about the cost of calling timer_get_time (gettimeofday) frequently, I let it go. What's your take on that? By the same token, I'm tempted to change the timing resolution to 1/32s, because 1/16, 62ms, means everything below about 90ms is really 1 unit. That's not very useful when the roundtrip from my TLVC system to Amsterdam and back most of the time is below 90, which means that the ktcp rtt ends up being for most of Europe as for the local network. Among other things, this low resolution clock will cause many packets in the retrans buffer to have the same timeout and thus getting dumped at once. Which in turn means that my turning around the retrans queue may be more useful than initially though. Also, better clock resolution is useful for debugging, possibly tuning.

Perhaps this is essentially the same thing, but the algorithm selecting "what to do" on a retrans may need to be significantly expanded baed on the real wall time passed when examining retrans packets, and possibly grouping them for different actions.

That has to be looked at, evaluating complexity against value, remembering that the problem at hand - ignoring the old 'hang stuff' - is only with XT type systems.

This brings up this other itch that keeps growing on me. Why does ktcp work just fine with anything except ktcp? TCP itself is super at adapting to speeds and jitter and packet loss and all kinds of irregularities. and does well with all these full scale (whatever that means) implementations, but not 'alone'. I have this feeling that there is something we're missing. Just a minor but also fundamental thing that pops up in certain circumstances and hits us in the back of the head.
I'm inclined to take on the drudgery of analysing packet traces, compare them and find out what's really going on. In fact I just set up another switch to mirror the traffic to/from the slowest system in order to du that. To completely understand what's going on before we eventually head into slow start and/or other sophisticated algorithms to alleviate the very-slow-system-problem. BTW - after having given it some thought, the VJ slow start may actually be a good idea (I'm quite familiar with it from the old days). We could actually use the retrans mechanism to do it: While in slow start mode, add the packets to the retrans buffer with the appropriate delay and NOT putting them on the wire them the first time around. But more about that when/if we get there. It's in some ways refreshing to read RFC2001 (which has been superseded, but remains (to me) the clearer source), which covers all four of the VJ algorithms.

Actually - in our case, and in particular related to the very slow systems, there are several situations in which slow start would come in handy in addition to the 'bootstrap' of a connection which my work has been concentrating on so far, like when the NIC is overrun and discards packets, floppy syncing as was mentioned before, even disk sync on a not-so-slow system with thousands of XMS buffers.

The same examination of real time passed when calculating a timeout-kickstart could also be done: the kickstart timeout could be higher or lower based on the last averaged throughput speed, so that fast transfers restart fast, whilst slower timeout later, fitting the systems involved.

True. And it may seem that we now have a setting in which the problem is repeatable to the extent it may actually be debugged – a good time to finally nail this one.

Just some thoughts. The ktcp problem of getting stuck has been around since the beginning. We may need to read up on exactly how send window management is supposed to work, other than the current way of just stopping, and waiting for ACKs or a timeout with too little else.

Indeed! The RFC mentioned above seems like a great starting point. Let me get back to that when I have bored myself with some endless packet traces :-)

@ghaerr
Copy link

ghaerr commented Jul 20, 2024

I'm tempted to change the timing resolution to 1/32s, because 1/16, 62ms, means everything below about 90ms is really 1 unit.

You make a good point to change the timing resolution, perhaps to match the HW clock interrupt at 100ms, instead of 1/32 seconds. That should bring in the best resolution possible, and it seems there's no real good reason to use 1/4, 1/8 etc divisions. I haven't checked the ktcp code but gettimeofdays resolution also matches HZ (=1/100 sec) so no conversions would be required.

Among other things, this low resolution clock will cause many packets in the retrans buffer to have the same timeout and thus getting dumped at once.

It would be interesting to know whether the retrans buffer timestamps would be the same with a 1/100s timer... I don't have ethernet specs handy, nor am up on how long a typical ktcp socket-read/process/network-write cycle takes. How many packets can be sent per second, given your current testing throughput?

being unsure about the cost of calling timer_get_time (gettimeofday) frequently, I let it go. What's your take on that?

I would treat making a single system call per ktcp cycle as very fast. The system call overhead is almost identical to the hw interrupt processing overhead (both go through the same irqit) and of course the 100 timer calls/second are thought as minuscule overhead.

That has to be looked at, evaluating complexity against value, remembering that the problem at hand - ignoring the old 'hang stuff' - is only with XT type systems.

True, but I would say it would be good to get an accurate reading on how much time is actually being spent performing basic ktcp cycles (e.g. read packet/process, input/process/write packet, etc). This can then be compared with max ethernet layer throughput speed to get another reading on value vs complexity and what a "max efficiency" would look like on real mode PCs.

Why does ktcp work just fine with anything except ktcp? TCP itself is super at adapting to speeds and jitter and packet loss and all kinds of irregularities.

That's a great question! I believe that ktcp is likely deficient in a core area that's never been addressed. Coming to an understanding of how real (Linux) TCP works in this area would really help. I'm almost certain that the reason ktcp<->Linux works well is because of Linux TCP's better implementation, but I don't actually know what we're missing.

I have this feeling that there is something we're missing. Just a minor but also fundamental thing that pops up in certain circumstances and hits us in the back of the head.

Exactly. But when I have thought about this before (and fixed some big problems), I realized that some issues are tied up in the basic paradigm of having ktcp run in user mode, and having the kernel having to schedule/swap requests, both incoming and outgoing, between applications and yet another application, ktcp itself. For instance, ktcp can never be sent a request that actually gets queued, that doesn't generate an immediately reply. (I'm not talking about network requests queued into CBs, but rather direct kernel read/write requests to/from ktcp - these must return immediately in order to allow another application to run, specifically another network application that might need to run. Ktcp can't sleep the calling process, the kernel has to). Sometimes I think that, with the current ability to run medium model (far) code, it might be better to have TCP move into the kernel. But all this is lots of work and for another day.

I also wonder whether the issue(s) with TCP would go away with more buffering, or whether we're dealing with a fundamental scheduling issue between applications or just ktcp's basic processing cycle. If the network congestion happens with only a single process on TLVC running, then the bigger issue for now is likely the old problem of not being able to handle network resysncs, which IMO has never actually worked as well as it needs to.

in our case, and in particular related to the very slow systems, there are several situations in which slow start would come in handy

Totally agree.

And it may seem that we now have a setting in which the problem is repeatable to the extent it may actually be debugged – a good time to finally nail this one.

Any thoughts on whether this problem can be duplicated within an emulator, rather than having to use real hardware? That would allow me to jump in much more easily to help.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 20, 2024

Thank you @ghaerr, now we're really getting somewhere. This is good.

I'm tempted to change the timing resolution to 1/32s, because 1/16, 62ms, means everything below about 90ms is really 1 unit.

You make a good point to change the timing resolution, perhaps to match the HW clock interrupt at 100ms, instead of 1/32 seconds. That should bring in the best resolution possible, and it seems there's no real good reason to use 1/4, 1/8 etc divisions.

It's the old fashioned hang to powers of 2, thinking it is somehow faster. Thinking about it, I can't find any relevant cases where that indeed is the case. Let's go for 100, jiffieswas my model all along.

Among other things, this low resolution clock will cause many packets in the retrans buffer to have the same timeout and thus getting dumped at once.

It would be interesting to know whether the retrans buffer timestamps would be the same with a 1/100s timer... I don't have ethernet specs handy, nor am up on how long a typical ktcp socket-read/process/network-write cycle takes. How many packets can be sent per second, given your current testing throughput?

I can read that off of the tcpdump logs. [COMING]

That has to be looked at, evaluating complexity against value, remembering that the problem at hand - ignoring the old 'hang stuff' - is only with XT type systems.

True, but I would say it would be good to get an accurate reading on how much time is actually being spent performing basic ktcp cycles (e.g. read packet/process, input/process/write packet, etc). This can then be compared with max ethernet layer throughput speed to get another reading on value vs complexity and what a "max efficiency" would look like on real mode PCs.

I agree. Given this discussion and the various issues encountered working on this issue (and even before I've concluded the packet tracing exercise), I'm leaning towards ripping out all the new counting stuff and adding slowstart+congestion control. I had forgotten all about the cwd - the congestion window - part of slowstart which I suspect will handle (part of) the speed differences we're facing elegantly, simple and generalized.

Why does ktcp work just fine with anything except ktcp? TCP itself is super at adapting to speeds and jitter and packet loss and all kinds of irregularities.

That's a great question! I believe that ktcp is likely deficient in a core area that's never been addressed.

Exactly - and I'm coming to suspect that slowstart/congestion control is at least part of it (it has been mandatory since forever by now). Seeing how the big guys do the startup phase is first on my list now, pending getting my new hardware setup going (below).

I have this feeling that there is something we're missing. Just a minor but also fundamental thing that pops up in certain circumstances and hits us in the back of the head.

Exactly. But when I have thought about this before (and fixed some big problems), I realized that some issues are tied up in the basic paradigm of having ktcp run in user mode, and having the kernel having to schedule/swap requests, both incoming and outgoing, between applications and yet another application, ktcp itself. For instance, ktcp can never be sent a request that actually gets queued, that doesn't generate an immediately reply. (I'm not talking about network requests queued into CBs, but rather direct kernel read/write requests to/from ktcp - these must return immediately in order to allow another application to run, specifically another network application that might need to run. Ktcp can't sleep the calling process, the kernel has to). Sometimes I think that, with the current ability to run medium model (far) code, it might be better to have TCP move into the kernel. But all this is lots of work and for another day.

Indeed. We've had this discussion before and it had to come back. Now we're even better prepared for it. You may be right - there may be issues with the architecture, task switching, waits, timeouts, locks that end up interrelating in mysterious ways, creating situations that would not exist in a kernel implementation. I also agree that a kernel TCP implementation is a big project for another time and day. That said, I have - on and off - been thinking about moving ICMP processing, actually everything except tcp, into the kernel. Then add UDP, also in the kernel. A natural progression would be to top off the show with tcp.

I also wonder whether the issue(s) with TCP would go away with more buffering, or whether we're dealing with a fundamental scheduling issue between applications or just ktcp's basic processing cycle. If the network congestion happens with only a single process on TLVC running, then the bigger issue for now is likely the old problem of not being able to handle network resyncs, which IMO has never actually worked as well as it needs to.

I would love to have the answer to that. I've played quite a bit with NIC level buffering and found - remembering you warned me about this before I started - that they have little effect. Packets are coming down the layers from ktcp at a rate that is lower than the 10Mbps (old) Ethernet rate and there is never any congestion on the local segment, so the leave right away and there is never a queue.

As to incoming packets the story is different, but not much (and depending on the speed of the system). I'm seeing 1, some times 2, rcv buffers being used, but rarely compared to the total # of packets. I need to add a counter for the to be displayed by netstat. What's missing though is to look close at the effect multiple concurrent flows of traffic and very slow systems will have on buffer usage.

Where (what layer - would you suggest we could experiment with more buffers?

in our case, and in particular related to the very slow systems, there are several situations in which slow start would come in handy

Totally agree.

ToDo list:

  • Commit what I have right now, which is quite functional. Then change the timer and the related stuff to [EDIT]10ms.
  • When it's working reliably, I'll rip out all the stuff added to handle very slow systems and add slowstart/congestion control.
  • While doing that, I intend to poke around in (my) memory/history to find the double ack story, and see if I can get rid of it. BTW I just learned that a double ack is a signal to the other end to "slow down, we have congestion problems". I didn't know that...

And it may seem that we now have a setting in which the problem is repeatable to the extent it may actually be debugged – a good time to finally nail this one.

Any thoughts on whether this problem can be duplicated within an emulator, rather than having to use real hardware? That would allow me to jump in much more easily to help.

I'll keep that in mind. If your QEMU setup allows ELKS to telnet out onto the Internet, I might open a path for you to telnet to one of my TLVC systems. That should give you ample delay to play with. I would need your IP address and allocate some odd port number you could use. Side note: While all the regular debug stuff in ktcp (and tcpdev) is useless to me when fighting timing related problems, this setting may actually allow for using them since they're so cheap in QEMU. Or maybe I'm not thinking straight.

I've been delayed in diving into the packet tracing exercise. I wanted to add another slow system to the setup, in order to expand the width of the testing and to eliminate the IBM5155 and the WD8003 driver from the list of suspects. Well, the Compaq Plus (4.77MHZ 8088) didn't fire up (PS problems), so I decided to get Dream System II (aka Dream XT) going now that the components are in place. It took some time (old interfaces, no docs, many jumpers etc. - hardware is fun dept.) but I'm in operation as of this morning, floppy (720k) only and ne1k. Along with the process came a handful of bugs that need fixing in TLVC - floppy, RTC, IDE and more. That's for later. It's running at 8MHz with a V20 processor - about twice as fast as the XT which is ideal.

IOW the setup now has no less than 4 TLVC systems spanning a really useful range of speeds, hardware and other capabilities. Eventually coming down to two, but right now this is useful. How the XT and the Dream XT communicate? I haven't tested that yet - will be really interesting!

@ghaerr
Copy link

ghaerr commented Jul 20, 2024

Let's go for 100, jiffieswas my model all along.
Then change the timer and the related stuff to [EDIT]10ms.

Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better - the base "tick" rate ought to be something that's less than half of the time spent sending or receiving an ethernet packet. How much time does that take with, say, the NE2K card?

For much more accurate timing, I have developed some debug code on ELKS that uses the 64-bit nanosecond real-time returned by the RTDSC instruction, available on 386+ CPUs. While this is obviously overkill on 8086s, a single function call could be made that wraps either RTDSC and/or gettimeofday syscall and returns real time in say, microsecond accuracy. This then would give much more accurate timing on 386 CPUs and allow for timing system calls themselves, which could still be useful on 8086s, and use the same constants in the TCP code. That is, obviously 1000 microseconds (usecs) = 1 ms (milliseconds) and use usecs as the base. Another idea would be to use a defined TBASE value which might be 1 or 1000 based on the system you're testing. Take a look at struct timeval returned by gettimeofday, it uses microseconds, using a 1 usec base would mean no multiplication or division after a gettimeofday syscall, and be faster.

Where (what layer - would you suggest we could experiment with more buffers?

This is a good question, and got me thinking about ktcp and its deficiencies. We ought to have a list of its non-implementations somewhere. For instance, the following are not supported:

  • Fragmentation reassembly (not really that useful)
  • ICMP (not a real layer, hacked ping response only)
  • UDP (possibly useful)
  • Buffering received packets past the very next unacknowledged packet (this is a big one - a single dropped packet means a required resend of all packets past that one again).

What does "real" TCP do? Ktcp literally discards any not-in-order packets. A real implementation buffers them, allowing processing of subsequent packets, or not? When a packet needs resending, the receiver TCP will ACK the last good packet, does that mean a sender always resends the entire outstanding window, and receiver TCPs should drop previously properly received but out of order packets? I don't recall exactly how this should work.

Anyways, buffering received but unprocessed packets might be good. One might think buffering output packets would be a good idea, but we need to read the spec to determine whether when an application that returns from write whether that guarantees that all packets have beens sent, or just queued. Of course, this could use up lots of memory needlessly, but a max buffering parameter could be set to allow tuning. This issue is compounded with the single /etc/tcpdev driver kernel in/out buffers, which are shared across all applications - definitely a bottleneck. But then so is ktcp itself, which has to manage all application requests in a single thread.

Regarding the above, kernel tcp buffers and ktcp, my original idea when I started improving TCP was to increase the buffers to at least a full ethernet buffer, then decrease the packet size and send window such that the system would somehow find a recognizable sweet spot in processing multiple requests between apps, kernel, ktcp, drivers, etc. In general, I believe this worked, except that the remote systems sometimes blew past any of our TCPs lower limits, and sent a large window of outstanding packets. IIRC we finally got the receive window size notification worked out, but this is another consideration when looking at congestion issues.

If your QEMU setup allows ELKS to telnet out onto the Internet

I'm actually thinking of the idea of loopback within QEMU (usually bad idea as the implementation doesn't use the NIC drivers at all), or perhaps trying to duplicate file transfer between QEMU sessions using some pre-setup scripts.

the setup now has no less than 4 TLVC systems spanning a really useful range of speeds, hardware and other capabilities.

That's very cool!!

Along with the process came a handful of bugs that need fixing in TLVC - floppy, RTC, IDE and more.

Geez. Keep me posted if these are also likely in ELKS!

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 21, 2024

Update on timings:

It would be interesting to know whether the retrans buffer timestamps would be the same with a 1/100s timer... I don't have ethernet specs handy, nor am up on how long a typical ktcp socket-read/process/network-write cycle takes. How many packets can be sent per second, given your current testing throughput?

When a 286/12.5MHz receives a packet (the packet is on the wire), the ack is visible in between 5 and 10ms. When the target is the V20/8MHz (which is about twice as fast as the XT), the time is between 9 and 20ms, sometimes as much as 39ms. (9ms is the ping rtt for the V20, local segment, add 2ms if a router is in between. Similarly the ping rtt for the 286 is 3ms, for the 386sx/40 it's 1.55ms). (different interfaces ie. different drivers, which accounts for some something too). That gives approx 11 packets/sec for the V20 (ne1k), 666 p/s for the 386 (ee16).

I think the ping data provide a reasonable clue as to the real capacity of a given system/NIC combination. These are minimal (64byte) ping packets, and the local segment rtt is pretty much equivalent to the target processing time - read-process-return, the return being trivial in that it's just copying the same data back into a new packet and kick it. What I'm seeing on the wire between the 286 and the V20 confirms that, 7-9 p/s - which covers packet processing, serial I/O etc.

As to the time a get_packet call inside a driver takes, jiffies don't have the resolution to measure that. @ghaerr, what's the smart way to get that given TLVC doesn't have the more advanced timing mechanisms you've added to ELKS?

@ghaerr
Copy link

ghaerr commented Jul 22, 2024

Thanks for the info on timings. It seems a 100ms timer will perhaps work for measuring ktcp cycles on a slow system, but won't give us much new information on normal packet throughput. It would be nice to have microsecond timing information available, but the only way I can think of getting it is by executing RTDSC on a 386 system, unless the PIT could be reprogrammed to read small increments of elapsed time in an unused timer.

given TLVC doesn't have the more advanced timing mechanisms you've added to ELKS?

Actually I haven't added anything more advanced to the ELKS kernel. The libc/debug code for --ftrace instrumentation has code for using the results of the 386+ RTDSC instruction in _get_micro_count, which displays elapsed microseconds in line 83.

If you think it worthwhile, I can look into wrapping executing RTDSC, gettimeofday, and/or a reprogrammed PIT register in order to get simple access to a microsecond timer. This could then possibly be automatically configured to give as accurate an elapsed time as possible. (Although not quite sure how this might work on QEMU).

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 22, 2024

Thank you @ghaerr.

I'm inclined to let the timing/instrumentation issue rest for now given the number of other issues piling up - and while helpful, I'm not seeing it as a dealbreaker towards hunting down the ktcp challenges. If this such turns out to be wrong, let's go ahead and fix it. Using your suggestions it doesn't seem like an unsurmountable challenge at all.

That said, and in lieu of our discussion about the timer resolution in ktcp: when changing it, given the gettimeofday 1us resolution and that the grouping effect of a lower resolution doesn't seem to be relevant - and not the least that we can still keep the timer @ 32 bits, using the native gettimeofday resolution may be smart. It eliminates a long division to boot.

About buffering and out of order packets ...

What does "real" TCP do? Ktcp literally discards any not-in-order packets. A real implementation buffers them, allowing processing of subsequent packets, or not?

I like to call it 'full' implementation rather than 'real' :-), and to me ktcps shortcut is fine. It isn't going to break anything, it's just slow. That's the known price of the shortcut - with very rare occurrences.

When a packet needs resending, the receiver TCP will ACK the last good packet, does that mean a sender always resends the entire outstanding window, and receiver TCPs should drop previously properly received but out of order packets? I don't recall exactly how this should work.

No, the sender will resend the un-acked packet only - unless a timeout triggers sending more. Duplicates just discarded. ktcp is different, as you allude to: Until the next (correct) segment arrives, incoming data packets are discarded because there is nowhere to put them. Again it works, but it's slow. In the case at hand it adds to the 'penalties' incurred by having a slow system in the first place.

Anyways, buffering received but unprocessed packets might be good.

Yes, it seems like the best place to start adding buffers, something skbufs light. BTW - this lack of flexibility in receiving makes 'helpers' like slow start and congestion avoidance even more important - like we talked about: Reducing the real send window according to the recipient's capacity not its buffer space.

One might think buffering output packets would be a good idea, but we need to read the spec to determine whether when an application that returns from write whether that guarantees that all packets have beens sent, or just queued.

Buffering output packets is good (and it's just like writing to a disk buffer: when in the buffer, the data is the OS' responsibility, the application continues) but currently of limited value as I've mentioned before. The network is generally faster than the machines and the chance of packets piling up for sending is minimal. We don't even get to use the tx buffers on the NICs where available. On the ne2k, sending (put_packet) is (IIRC) synchronous, we're actually waiting for completion, but my added transmit buffer never got used. This corresponds with the experience from the ee16 driver development, where I got a race condition because TX complete interrupt kicked in instantly after the 'go' command to the NIC, before the completion of the put_packet routine.

@ghaerr
Copy link

ghaerr commented Jul 22, 2024

given the gettimeofday 1us resolution
we can still keep the timer @ 32 bits, using the native gettimeofday resolution may be smart. It eliminates a long division to boot.

After reading your comments, I agree - using native gettimeofday will work fine for now, and I now realize that later, on 386 systems, RTDSC could be added to the kernel to enhance the resolution of gettimeofday without any changes to applications at all. I'll look into that separately.

this lack of flexibility in receiving makes 'helpers' like slow start and congestion avoidance even more important

Agreed. Perhaps the reason why ktcp works well when communicating with "full" implementations is because of slow start. I'll read up more on it.

where I got a race condition because TX complete interrupt kicked in instantly after the 'go' command to the NIC, before the completion of the put_packet routine.

Wow! So the TX complete interrupt occurred before the routine (put_packet) finished loading the transmit buffer? That almost sounds like a "TX buffer available" interrupt than actual transmit complete. Nonetheless I can see how this might be tricky if the kernel were reentrant or the driver allowed multiple I/O requests concurrently.

@Mellvik
Copy link
Owner Author

Mellvik commented Jul 22, 2024

where I got a race condition because TX complete interrupt kicked in instantly after the 'go' command to the NIC, before the completion of the put_packet routine.

Wow! So the TX complete interrupt occurred before the routine (put_packet) finished loading the transmit buffer?

Not quite, the data were in place but the globals weren't updated.

That almost sounds like a "TX buffer available" interrupt than actual transmit complete.

Yes, they become almost the same.
In this particular case, the point became to move updates of all globals to before the 'go'

I'm committing this PR now, creating a new one with the updated timers and the first pieces of slow start.

@Mellvik Mellvik merged commit 74c3852 into master Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants