Conversation
|
Wow, this sounds like a great fix, especially since it doesn't really change anything until the forced kernel wait has occurred. I notice though the the INITIAL_TIMEOUT_RTT is changed from 1 second to 1/4 second... does this help get things moving initially? I seem to remember having had to increase it to 1s, possibly for this same behavior, quite a while back. I'm interested in your comments on the initial timeout. |
Thanks, @ghaerr.
Thanks for noticing - and mentioning. it's has been part of the experimentation to improve convergence speed. It's definitely going back to 1s. |
|
An update [developer notes really] pt I: As I removed the (now) small debug printfs from ktcp, the behaviour for the slowest systems changed. The printf delays apparently allow more ACKs to get through. Anyway, the consequence of removing them was that convergence - the time it takes to determine the speed of the remote system, is slower. In turn it means many more retrans sent before the speed index reaches the correct level. The difference between with and without the printf in which is quite small, is 5 retransmits with the printf, >50 without, some times 100s. So the mechanism for detecting peer speed is far from flawless. Initially it seemed strange that this printf would have such big effect, until I realized that Anyway, not good enough. Off the bat it seemed like counting consecutive retransmits would make a more reliable metric that consecutive waits-for-bnuffer-space calls. But before even trying that, I moved the The 'almost' for XT systems is that in some cases the speed index ends up just below what it should have been for the retransmits to be eliminated completely after establishing the index ('convergence'). This is ONLY for the very slow systems. We end up with retransmits at regular intervals, like every few seconds - not all that bad, but definitely not good. There is another problem too, although not related to this. If the peer, in addition to being slow, runs on floppies or have mounted floppies AND has Indeed it does and I don't (yet) understand why. A very high speed-index (indicating a very slow system: normally 8-9 for 12.5MHz 386, 15-18 for XT systems) turns out to cause frequent complete stops in sending, In the meanwhile, a cap on the speed index is required to avoid this situation, which I added. We're getting there. To be continued. |
|
Wow @Mellvik, you're definitely in the middle of some complicated stuff. In general, I agree with the approach you're taking. I haven't looked to see if there are any implications in moving add_to_retrans to after ip_sendpacket, but sounds like that sends the packet first, then adds to retrans list. One would think it would always be better to send first, so not sure why the retrans add was coded that way in the first place. It appears the longstanding problem of ktcp getting "stuck" is still here... no doubt any solution will likely require a timeout-kickstart in order to get going again in bad sync loss; the question is how long to wait? Which leads to my next thoughts:
Actually, I believe you're on to a potentially very good solution, since the fix (with the printf) initially seemed to completely fix the problem. I'm thinking that we move away from "counting" loops inside ktcp (which is still a good idea and the basis for the fix) but then move to a real wall time timestamp being compared, rather than adding a somewhat arbitrary loop count times a multiplier to the original timestamp. Perhaps this is essentially the same thing, but the algorithm selecting "what to do" on a retrans may need to be significantly expanded baed on the real wall time passed when examining retrans packets, and possibly grouping them for different actions. The same examination of real time passed when calculating a timeout-kickstart could also be done: the kickstart timeout could be higher or lower based on the last averaged throughput speed, so that fast transfers restart fast, whilst slower timeout later, fitting the systems involved. Just some thoughts. The ktcp problem of getting stuck has been around since the beginning. We may need to read up on exactly how send window management is supposed to work, other than the current way of just stopping, and waiting for ACKs or a timeout with too little else. IIRC, there were some important articles by Van Jacobson(?) about sender backoff and RTT calculation back in the day that avoided too-quick changes. I can try to find those, they are mentioned in other TCP sources as well.
This points to possibly having to keep an "average" calculation, rather than a direct calculation only adding to the last summed value. I ran into this when computing the CPU % usage for |
|
Thanks @ghaerr, this is very useful - the discussion itself and its forcing an understandable description of the challenges and potential solutions. Incites double checking, seeing more alternatives, filtering stupidities - you know the drill.
This is interesting. I was on a somewhat similar tangent the other day, not caring much for the inaccuracy introduced by checking wall time only once per ktcp loop. However, but being unsure about the cost of calling
That has to be looked at, evaluating complexity against value, remembering that the problem at hand - ignoring the old 'hang stuff' - is only with XT type systems. This brings up this other itch that keeps growing on me. Why does Actually - in our case, and in particular related to the very slow systems, there are several situations in which slow start would come in handy in addition to the 'bootstrap' of a connection which my work has been concentrating on so far, like when the NIC is overrun and discards packets, floppy syncing as was mentioned before, even disk sync on a not-so-slow system with thousands of XMS buffers.
True. And it may seem that we now have a setting in which the problem is repeatable to the extent it may actually be debugged – a good time to finally nail this one.
Indeed! The RFC mentioned above seems like a great starting point. Let me get back to that when I have bored myself with some endless packet traces :-) |
You make a good point to change the timing resolution, perhaps to match the HW clock interrupt at 100ms, instead of 1/32 seconds. That should bring in the best resolution possible, and it seems there's no real good reason to use 1/4, 1/8 etc divisions. I haven't checked the ktcp code but
It would be interesting to know whether the retrans buffer timestamps would be the same with a 1/100s timer... I don't have ethernet specs handy, nor am up on how long a typical ktcp socket-read/process/network-write cycle takes. How many packets can be sent per second, given your current testing throughput?
I would treat making a single system call per ktcp cycle as very fast. The system call overhead is almost identical to the hw interrupt processing overhead (both go through the same
True, but I would say it would be good to get an accurate reading on how much time is actually being spent performing basic ktcp cycles (e.g. read packet/process, input/process/write packet, etc). This can then be compared with max ethernet layer throughput speed to get another reading on value vs complexity and what a "max efficiency" would look like on real mode PCs.
That's a great question! I believe that ktcp is likely deficient in a core area that's never been addressed. Coming to an understanding of how real (Linux) TCP works in this area would really help. I'm almost certain that the reason ktcp<->Linux works well is because of Linux TCP's better implementation, but I don't actually know what we're missing.
Exactly. But when I have thought about this before (and fixed some big problems), I realized that some issues are tied up in the basic paradigm of having ktcp run in user mode, and having the kernel having to schedule/swap requests, both incoming and outgoing, between applications and yet another application, ktcp itself. For instance, ktcp can never be sent a request that actually gets queued, that doesn't generate an immediately reply. (I'm not talking about network requests queued into CBs, but rather direct kernel read/write requests to/from ktcp - these must return immediately in order to allow another application to run, specifically another network application that might need to run. Ktcp can't sleep the calling process, the kernel has to). Sometimes I think that, with the current ability to run medium model (far) code, it might be better to have TCP move into the kernel. But all this is lots of work and for another day. I also wonder whether the issue(s) with TCP would go away with more buffering, or whether we're dealing with a fundamental scheduling issue between applications or just ktcp's basic processing cycle. If the network congestion happens with only a single process on TLVC running, then the bigger issue for now is likely the old problem of not being able to handle network resysncs, which IMO has never actually worked as well as it needs to.
Totally agree.
Any thoughts on whether this problem can be duplicated within an emulator, rather than having to use real hardware? That would allow me to jump in much more easily to help. |
|
Thank you @ghaerr, now we're really getting somewhere. This is good.
It's the old fashioned hang to powers of 2, thinking it is somehow faster. Thinking about it, I can't find any relevant cases where that indeed is the case. Let's go for 100,
I can read that off of the tcpdump logs. [COMING]
I agree. Given this discussion and the various issues encountered working on this issue (and even before I've concluded the packet tracing exercise), I'm leaning towards ripping out all the new counting stuff and adding slowstart+congestion control. I had forgotten all about the
Exactly - and I'm coming to suspect that slowstart/congestion control is at least part of it (it has been mandatory since forever by now). Seeing how the big guys do the startup phase is first on my list now, pending getting my new hardware setup going (below).
Indeed. We've had this discussion before and it had to come back. Now we're even better prepared for it. You may be right - there may be issues with the architecture, task switching, waits, timeouts, locks that end up interrelating in mysterious ways, creating situations that would not exist in a kernel implementation. I also agree that a kernel TCP implementation is a big project for another time and day. That said, I have - on and off - been thinking about moving ICMP processing, actually everything except tcp, into the kernel. Then add UDP, also in the kernel. A natural progression would be to top off the show with tcp.
I would love to have the answer to that. I've played quite a bit with NIC level buffering and found - remembering you warned me about this before I started - that they have little effect. Packets are coming down the layers from ktcp at a rate that is lower than the 10Mbps (old) Ethernet rate and there is never any congestion on the local segment, so the leave right away and there is never a queue. As to incoming packets the story is different, but not much (and depending on the speed of the system). I'm seeing 1, some times 2, rcv buffers being used, but rarely compared to the total # of packets. I need to add a counter for the to be displayed by Where (what layer - would you suggest we could experiment with more buffers?
ToDo list:
I'll keep that in mind. If your QEMU setup allows ELKS to telnet out onto the Internet, I might open a path for you to telnet to one of my TLVC systems. That should give you ample delay to play with. I would need your IP address and allocate some odd port number you could use. Side note: While all the regular debug stuff in ktcp (and tcpdev) is useless to me when fighting timing related problems, this setting may actually allow for using them since they're so cheap in QEMU. Or maybe I'm not thinking straight. I've been delayed in diving into the packet tracing exercise. I wanted to add another slow system to the setup, in order to expand the width of the testing and to eliminate the IBM5155 and the WD8003 driver from the list of suspects. Well, the Compaq Plus (4.77MHZ 8088) didn't fire up (PS problems), so I decided to get Dream System II (aka Dream XT) going now that the components are in place. It took some time (old interfaces, no docs, many jumpers etc. - hardware is fun dept.) but I'm in operation as of this morning, floppy (720k) only and ne1k. Along with the process came a handful of bugs that need fixing in TLVC - floppy, RTC, IDE and more. That's for later. It's running at 8MHz with a V20 processor - about twice as fast as the XT which is ideal. IOW the setup now has no less than 4 TLVC systems spanning a really useful range of speeds, hardware and other capabilities. Eventually coming down to two, but right now this is useful. How the XT and the Dream XT communicate? I haven't tested that yet - will be really interesting! |
Whether you choose 1, 10 or 100ms isn't probably a big deal if we stick with the jiffies-related (100ms) hw timer as base real time, but the lower the better - the base "tick" rate ought to be something that's less than half of the time spent sending or receiving an ethernet packet. How much time does that take with, say, the NE2K card? For much more accurate timing, I have developed some debug code on ELKS that uses the 64-bit nanosecond real-time returned by the RTDSC instruction, available on 386+ CPUs. While this is obviously overkill on 8086s, a single function call could be made that wraps either RTDSC and/or gettimeofday syscall and returns real time in say, microsecond accuracy. This then would give much more accurate timing on 386 CPUs and allow for timing system calls themselves, which could still be useful on 8086s, and use the same constants in the TCP code. That is, obviously 1000 microseconds (usecs) = 1 ms (milliseconds) and use usecs as the base. Another idea would be to use a defined TBASE value which might be 1 or 1000 based on the system you're testing. Take a look at struct timeval returned by gettimeofday, it uses microseconds, using a 1 usec base would mean no multiplication or division after a gettimeofday syscall, and be faster.
This is a good question, and got me thinking about ktcp and its deficiencies. We ought to have a list of its non-implementations somewhere. For instance, the following are not supported:
What does "real" TCP do? Ktcp literally discards any not-in-order packets. A real implementation buffers them, allowing processing of subsequent packets, or not? When a packet needs resending, the receiver TCP will ACK the last good packet, does that mean a sender always resends the entire outstanding window, and receiver TCPs should drop previously properly received but out of order packets? I don't recall exactly how this should work. Anyways, buffering received but unprocessed packets might be good. One might think buffering output packets would be a good idea, but we need to read the spec to determine whether when an application that returns from Regarding the above, kernel tcp buffers and ktcp, my original idea when I started improving TCP was to increase the buffers to at least a full ethernet buffer, then decrease the packet size and send window such that the system would somehow find a recognizable sweet spot in processing multiple requests between apps, kernel, ktcp, drivers, etc. In general, I believe this worked, except that the remote systems sometimes blew past any of our TCPs lower limits, and sent a large window of outstanding packets. IIRC we finally got the receive window size notification worked out, but this is another consideration when looking at congestion issues.
I'm actually thinking of the idea of loopback within QEMU (usually bad idea as the implementation doesn't use the NIC drivers at all), or perhaps trying to duplicate file transfer between QEMU sessions using some pre-setup scripts.
That's very cool!!
Geez. Keep me posted if these are also likely in ELKS! |
|
Update on timings:
When a 286/12.5MHz receives a packet (the packet is on the wire), the ack is visible in between 5 and 10ms. When the target is the V20/8MHz (which is about twice as fast as the XT), the time is between 9 and 20ms, sometimes as much as 39ms. (9ms is the ping rtt for the V20, local segment, add 2ms if a router is in between. Similarly the ping rtt for the 286 is 3ms, for the 386sx/40 it's 1.55ms). (different interfaces ie. different drivers, which accounts for some something too). That gives approx 11 packets/sec for the V20 (ne1k), 666 p/s for the 386 (ee16). I think the ping data provide a reasonable clue as to the real capacity of a given system/NIC combination. These are minimal (64byte) ping packets, and the local segment rtt is pretty much equivalent to the target processing time - read-process-return, the return being trivial in that it's just copying the same data back into a new packet and kick it. What I'm seeing on the wire between the 286 and the V20 confirms that, 7-9 p/s - which covers packet processing, serial I/O etc. As to the time a get_packet call inside a driver takes, jiffies don't have the resolution to measure that. @ghaerr, what's the smart way to get that given TLVC doesn't have the more advanced timing mechanisms you've added to ELKS? |
|
Thanks for the info on timings. It seems a 100ms timer will perhaps work for measuring ktcp cycles on a slow system, but won't give us much new information on normal packet throughput. It would be nice to have microsecond timing information available, but the only way I can think of getting it is by executing RTDSC on a 386 system, unless the PIT could be reprogrammed to read small increments of elapsed time in an unused timer.
Actually I haven't added anything more advanced to the ELKS kernel. The libc/debug code for --ftrace instrumentation has code for using the results of the 386+ RTDSC instruction in If you think it worthwhile, I can look into wrapping executing RTDSC, gettimeofday, and/or a reprogrammed PIT register in order to get simple access to a microsecond timer. This could then possibly be automatically configured to give as accurate an elapsed time as possible. (Although not quite sure how this might work on QEMU). |
|
Thank you @ghaerr. I'm inclined to let the timing/instrumentation issue rest for now given the number of other issues piling up - and while helpful, I'm not seeing it as a dealbreaker towards hunting down the That said, and in lieu of our discussion about the timer resolution in About buffering and out of order packets ...
I like to call it 'full' implementation rather than 'real' :-), and to me
No, the sender will resend the un-acked packet only - unless a timeout triggers sending more. Duplicates just discarded.
Yes, it seems like the best place to start adding buffers, something
Buffering output packets is good (and it's just like writing to a disk buffer: when in the buffer, the data is the OS' responsibility, the application continues) but currently of limited value as I've mentioned before. The network is generally faster than the machines and the chance of packets piling up for sending is minimal. We don't even get to use the tx buffers on the NICs where available. On the ne2k, sending (put_packet) is (IIRC) synchronous, we're actually waiting for completion, but my added transmit buffer never got used. This corresponds with the experience from the ee16 driver development, where I got a race condition because TX complete interrupt kicked in instantly after the 'go' command to the NIC, before the completion of the put_packet routine. |
After reading your comments, I agree - using native gettimeofday will work fine for now, and I now realize that later, on 386 systems, RTDSC could be added to the kernel to enhance the resolution of gettimeofday without any changes to applications at all. I'll look into that separately.
Agreed. Perhaps the reason why ktcp works well when communicating with "full" implementations is because of slow start. I'll read up more on it.
Wow! So the TX complete interrupt occurred before the routine (put_packet) finished loading the transmit buffer? That almost sounds like a "TX buffer available" interrupt than actual transmit complete. Nonetheless I can see how this might be tricky if the kernel were reentrant or the driver allowed multiple I/O requests concurrently. |
Not quite, the data were in place but the globals weren't updated.
Yes, they become almost the same. I'm committing this PR now, creating a new one with the updated timers and the first pieces of slow start. |

A new fix for the
ktcpproblem discussed at length in #68: excessive retransmits when talking to very slow systems (like TLVC talking to TLVC or ELKS).The scenario is this: A very slow system, say an XT or a 286 AT,
telnets to another TLVC/ELKS host and does something that creates a lot of output, likehd /bin/psfor example. The server will deliver reliably but slow because thektcpretransmit timeouts don't work with such slow clients. A similar pattern may be observed when sending files to such as slow system (ftp).After understanding the details of the problem (again, see #68), and testing several fixes that worked but have drawbacks, such as not being able to adapt to the degree of slowness on the client side, here's a solution that seem to satisfy reasonable demands, including not breaking anything(!).
The primary challenge was to find a reliable metric for the (network) performance of the client. The solution turned out to be hiding right under my nose: When the send window closes (the retransmit buffer is full),
ktcpwill 'loop' that particular connection until the window opens up again. Counting the number of times it loops turns out to be the metric we're looking for.The fix monitors the loop and establishes a performance index for the client using the max # of consecutive loops recorded. It usually takes 2-3 attempts to get there and a few retransmits will be sent in the meantime, but the cost seems acceptable given the dynamics the solution brings to the table. Like - a 286AT will get an index of 6 give or take 1, and XT (4.7MHz) will end up between 13 and 15 - for example. Most systems will have a speed index of 0 - or, if on a long or slow connection, 1, which does not affect normal operation. The speed-factor is then used to increase the retransmit timeout for the connection without touching the regular RTT and RTO values.
The price for the fix is 2 additional bytes in the
CB structand a little bit of code.