Bits Up!: congestion control

Showing posts with label congestion control. Show all posts

Thursday, April 6, 2017

On The Merits of QUIC for HTTP

I am often asked why the Internet HTTP community is working on an IETF based QUIC when HTTP/2 (RFC 7540) is less than 2 years old. There are good answers! This work is essentially part two of a planned evolution to improve speed, reliability, security, responsiveness, and architecture. These needs were understood when HTTP/2 was standardized but the community decided to defer larger architectural changes to the next version. That next version is arriving now in the form of QUIC.

Flashback to HTTP/2 Development

HTTP/2 development consciously constrained its own work in two ways. One was to preserve HTTP/1 semantics, and the other was to limit the work to what could be done within a traditional TLS/TCP stack.

The choice to use TLS/TCP in HTTP/2 was not a forgone conclusion. During the pre-HTTP/2 bakeoff phase there was quite a bit of side channel chatter about whether anyone would propose a different approach such as the SCTP/UDP/[D]TLS stack RTC DataChannels was contemporaneously considering for standardization.

In the end folks felt that experience and running code were top tier properties for the next revision of HTTP. That argued for TCP/TLS. Using the traditional stack lowered the risk of boiling the ocean, improved the latency of getting something deployed, and captured the meaningful low hanging fruit of multiplexing and priority that HTTP/2 was focused on.

Issues that could not be addressed with TCP/TLS 1.2 were deemed out of scope for that round of development. The mechanisms of ALPN and Alternative Services were created to facilitate successor protocols.

I think our operational experience has proven HTTP/2's choices to be a good decision. It is a big win for a lot of cases, it is now the majority HTTPS protocol, and I don't think we could have done too much better within the constraints applied.

The successor protocol turns out to be QUIC. It is a good sign for the vibrancy of HTTP/2 that the QUIC charter explicitly seeks to map HTTP/2 into its new ecosystem. QUIC is taking on exactly the items that were foreseen when scoping HTTP/2.

This post aims to highlight the benefits of QUIC as it applies to the HTTP ecosystem. I hope it is useful even for those that already understand the protocol mechanics. It, however, does not attempt to fully explain QUIC. The IETF editor's drafts or Jana's recent tutorial might be good references for that.

Fixing the TCP In-Order Penalty

The chief performance frustration with HTTP/2 happens during higher than normal packet loss. The in-order property of TCP spans multiplexed HTTP messages. A single packet loss in one message prevents subsequent unrelated messages from being delivered until the loss is repaired. This is because TCP delays received data in order to provide in-order delivery of the whole stream.

For a simple example imagine images A and B each delivered in two parts in this order: A1, A2, B1, and B2. If only A1 were to suffer a packet loss under TCP that would also delay A2, B1, and B2. While image A is unavoidably damaged by this loss, image B is also impacted even though all of its data was successfully transferred the first time.

This is something that was understood during the development of RFC 7540 and was correctly identified as a tradeoff favoring connections with lower loss rates. Our community has seen some good data on how this has played out "in the wild" recently from both Akamai and Fastly. For most HTTP/2 connections this strategy has been beneficial, but there is a tail population that actually regresses in performance compared to HTTP/1 under high levels of loss because of this.

QUIC fixes this problem through multistreaming onto one connection in a way very familiar to 7540 but with an added twist. It also gives each stream its own ordering context analagous to a TCP sequence number. These streams can be delivered independently to the application because in-order only applies to each stream instead of the whole connection in QUIC.

I believe fixing this issue is the highest impact feature of QUIC.

Starting Faster

HTTP/2 starts much faster than HTTP/1 due to its ability to send multiple requests in the first round trip and its superior connection management results in fewer connection establishments. However, new connections using the TCP/TLS stack still incur 2 or 3 round trips of delay before any HTTP/2 data can be sent.

In order to address this QUIC eschews layers in favor of a component architecture that allows sending encrypted application data immediately. QUIC still uses TLS for security establishment and has a transport connection concept, but these components are not forced into layers that require their own round trips to be initialized. Instead the transport session, the HTTP requests, and the TLS context are all combined into the first packet flight when doing session resumption (i.e. you are returning to a server you have seen before). A key part of this is integration with TLS 1.3 and in particular the 0-RTT (aka Early Data) handshake feature.

The HTTP/2 world, given enough time, will be able to capture some of the same benefits using both TLS 1.3 and TCP Fast Open. Some of that work is made impractical by dependencies on Operating System configurations and the occasional interference from middleboxes unhappy with TCP extensions.

However, even at full deployment of TLS 1.3 and TCP Fast Open, that approach will lag QUIC performance because QUIC can utilize the full flight of data in the first round trip while Fast Open limits the amount of data that can be carried to the roughly 1460 bytes available in a single TCP SYN packet. That packet also needs to include the TLS Client Hello and HTTP SETTINGS information along with any HTTP requests. That single packet runs out of room quickly if you need to encode more than one request or any message body. Any excess needs to wait a round trip.

Harmonizing with TLS

When used with HTTP/1 and HTTP/2, TLS generally operates as a simple pipe. During encryption cleartext streams of bytes go in one side and a stream of encrypted bytes come out the other and are then fed to TCP. The reverse happens when decrypting. Unfortunately, the TLS layer operates internally on multi-byte records instead of a byte stream and the mismatch creates a significant performance problem.

The records can be up to 64KB and a wide variety of sizes are used in practice. In order to enforce data integrity, one of the fundamental security properties of TLS, the entire record must be received before it can be decoded. When the record spans multiple packets a problem similar to the "TCP in-order penalty" discussed earlier appears.

A loss to any packet in the record delays decoding and delivery of the other correctly delivered packets while the loss is repaired. In this case the problem is actually a bit worse as any loss impacts the whole record not just the portion of the stream following the loss. Further, because application delivery of the first byte of the record is always dependent on the receipt of the last byte of the record simple serialization delays or common TCP congestion-control stalls add latency to application delivery even with 0% packet loss.

The 'obvious fix' of placing an independent record in each packet turns out to work much better with QUIC than TCP. This is because TCP's API is a simple byte stream. Applications, including TLS, have no sense of where packets begin or end and have no reliable control over it. Furthermore, TCP proxies or even HTTP proxies commonly rearrange TCP packets while leaving the byte stream in tact (a proof of the practical value of end to end integrity protection!).

Even the absurd-um solution of 1 byte records does not work because the record overhead creates multibyte sequences that will still span packet boundaries. Such a naive approach would also drown in its own overhead.

QUIC shines here by using its component architecture rather than the traditional layers. The QUIC transport layer receives plaintext from the application and consults its own transport information regarding packet numbers, PMTU, and the TLS keying information. It combines all of this to form the encrypted packets that can be decrypted atomically with the equivalent of one record per UDP packet. Intermediaries are unable to mess with the framing because even the transport layer is integrity protected in QUIC during this same process - a significant security bonus! Any loss events will only impact delivery of the lost packet.

No More TCP RST Data Loss

As many HTTP developers will tell you, TCP RST is one of the most painful parts of the existing ecosystem. Its pain comes in many forms, but the data loss is the worst.

The circumstances for an operating system generating a RST and how they respond to them can vary by implementation. One common scenario is a server close()ing a connection that has received another request that the HTTP server has not yet read and is unaware of. This is a routine case for HTTP/1 and HTTP/2 applications. Most kernels will react to the closing of a socket with unconsumed data by sending a RST to the peer.

That RST will be processed out of order when received. In practice this means if the original client does a recv() to consume ordinary data that was sent by the server before the server invoked close() the client will incur an unrecoverable failure if the RST has also already arrived and that data cannot ever be read. This is true even if the kernel has sent a TCP ack for it! The problem gets worse when combined with larger TLS record sizes as often the last bit of data is what is needed to decode the whole record and substantial data loss of up to 64KB occurs.

The QUIC RST equivalent is not part of the orderly shutdown of application streams and it is not expected to ever force the loss of already acknowledged data.

Better Responsiveness through Buffer Management

The primary goal of HTTP/2 was the introduction of multiplexing into a single connection and it was understood that you cannot have meaningful multiplexing without also introducing a priority scheme. HTTP/1 illustrates the problem well - it multiplexes the path through unprioritized TCP parallelism which routinely gives poor results. The final RFC contained both multiplexing and priority mechanisms which for the most part work well.

However, successful prioritization requires you to buffer before serializing the byte stream into TLS and TCP because once sent to TCP those messages cannot be reordered in the case of higher priority data presenting itself. Unfortunately high latency TCP, requires a significant amount of buffering at the socket layer in order to run as fast as possible. These two competing interests make it difficult to judge how much buffering an HTTP/2 sender should use. While there are some Operating System specific oracles that give some clues, TCP itself does not provide any useful guidance to the application for reasonably sizing its socket buffers.

This combination has made it challenging for applications to determine the appropriate level of socket buffering and in turn they sometimes have overbuffered in order to make TCP run at line rate. This results in poor responsiveness to the priority schedule and the inability for a server to recognize individual streams being canceled (which happens more than you may think) because they have already been buffered.

The blending of transport and application components creates the opportunity for QUIC implementations to do a better job on priority. They do this by buffering application data with its priority information outside of the transmission layer. This allows the late binding of the packet transmission to the data that is highest priority at that moment.

Relatedly, whenever a retransmission is required QUIC retransmits the original data in one or more new packets (with new packet identifiers) instead of retransmitting a copy of the lost packet as TCP does. This creates an opportunity to reprioritize, or even drop canceled streams, during retransmission. This compares favorably to TCP which is sentenced to retransmitting the oldest (and perhaps now irrelevant) data first due to its single sequence number and in-order properties.

UDP means Universal DePloyment

QUIC is not inherently either a user space or kernel space protocol - it is quite possible to deploy it in either configuration. However, UDP based applications are often deployed in userspace configurations and do not require special configurations or permissions to run there. It is fair to expect a number of user space based QUIC implementations.

Time will tell exactly what that looks like, but I anticipate it will be a combination of self-updating evergreen applications such as web servers and browsers and also a small set of well maintained libraries akin to the role openssl plays in distributions.

Decoupling functionality traditionally performed by TCP from the operating system creates an opportunity for deploying software faster, updating it more regularly, and iterating on its algorithms in a tight loop. The long replacement and maintenance schedules of operating systems, sometimes measured in decades, create barriers to deploying new approaches to networking.

This new freedom applies both QUIC itself, but also to some of its embedded techniques that have equivalents in the TCP universe that have traditionally been difficult to deploy. Thanks to user space distribution packet pacing, fast open, and loss discovery improvements like RACK will see greater deployment than ever before.

Userspace will mean faster evolution in networking and greater distribution of the resulting best practices.

---

The IETF QUIC effort is informed by Google's efforts on its own preceding protocol. While it is not the same effort it does owe a debt to a number of Google folk. I'm not privy to all of the internal machinations at G but, at the inevitable risk of omitting an important contribution, it is worth calling out Jim Roskind, Jana Iyengar, Ian Swett, Adam Langley, and Ryan Hamilton both for their work and their willingness to evangelize and discuss it with me. Thanks! We're making a better Internet together.

This post was originally drafted as a post to the IETF HTTP Working Group by Patrick McManus ,

Wednesday, December 5, 2012

Smarter Network Page Load for Firefox

I just landed some interesting code for bug 792438. It should be in the December 6th nightly, and If it sticks it will be part of Firefox 20.

The new feature keeps the network clear of media transfers while a page's basic html ,css, and js are loaded. This results in a faster first paint as the elements that block initial rendering are loaded without competition. Total page load is potentially slightly regressed, but the responsiveness more than makes up for it. Similar algorithms report first paint improvements of 30% with pageload regressions around 5%. There are some cases where the effect is a lot more obvious than 30% - try pinterest.com, 163.com, www.skyrock.com, or www.tmz.com - all of these sites have large amounts of highly parallel media on the page that, in previous versions, competed with the required css.

Any claim of benefit is going to depend on bandwidth, latency, and the contents of your cache (e.g. if you've already got the js and css cached, this is moot.. or if you have a localhost connection like talos tp5 uses it is also moot because bandwidth and latency are essentially ideal there).

Before I get to the nitty gritty I think its worth a paragraph of Inside-Mozilla-Baseball to mention what a fun project this was. I say that in particular because it involved a lot of cross team participation on many different aspects (questions, advice, data, code in multiple modules, and reviews). I think we can all agree that when many different people are involved on the same effort efficiency is typically the first casualty. Perhaps this project is the exception needed to prove the rule - it went from a poorly understood use case to commited code very quickly. Ehsan, Boris, Dave Mandelin, Josh Aas, Taras, Henri, Honza Bambas, Joe Drew, and Chromium's Will Chan helped one and all - its not too often you get the rush of everyone rowing in sync; but it happened here and was awesome to behold and good for the web.

In some ways this feature is not intuitive. Web performance is generally improved by adding parallelization due to large amounts of unused bandwidth left over by the single session HTTP/1 model. In this case we are actually reducing parallelization which you wouldn't think would be a good thing. Indeed that is why it can regress total page load time for some use cases. However, parallelization is a double edged sword.

When parallelization works well it is because it helps cover up moments of idle bandwidth in a single HTTP transaction (e.g. latency in the handshake, latency during the header phase, or even latency pauses involved in growing the TCP congestion window) and this is critically important to overall performance. Servers engage in hostname sharding just to opt into dozens of parallel connections for performance reasons.

On the other hand, when it works poorly parallelization kills you in a couple of different ways. The more spectacular failure mode I'll be talking about in a different post (bug 813715) but briefly the over subscription of the link creates induced queues and TCP losses that interact badly with TCP congestion control and recovery. But the issue at hand here is more pedestrian - if you share a connection 50 ways and all 50 sessions are busy then each one gets just 2% of the bandwidth. They are inherently fair with each other and without priority even though that doesn't reflect their importance to your page.

If the required js and css is only getting 10% of the bandwidth while the images are getting 90% then the first meaningful paint is woefully delayed. The reason you do the parallelism at all is because many of those connections will be going through one of the aforementioned idle moments and aren't all simultaneously busy - so its a reasonable strategy as long as maximizing total bandwidth utilization is your goal.. but in the case of an HTML page load some resources are more important than others and it isn't worth sacrificing that ordering to perfectly optimize total page load. So this patch essentially breaks page load into two phases to sort out that problem.

The basic approach here is the same as used by Webkit's ResourceLoadScheduler. So Webkit browsers already do the basic idea (and have validated it). I decided we wanted to do this at the network layer instead of in content or layout to enable a couple extra bonuses that Webkit can't give because it operates on a higher layer:

If we know apriori that a piece of work is highly sensitive to latency but is very low in bandwidth then we can avoid holding it back even if that work is just part of an HTTP transaction. As part of this patchset I have included the ability to preconnect a small quota of 6 media TCP sessions at a time while css/js is loading. More than 6 can complete, its just limited to 6 outstanding at one instant to bound the bandwidth consumption. This results in a nice little hot pool of connections all ready for use by the image transfers when they are cleared for takeoff. You could imagine this being slightly expanded to a small number of parallel HTTP media transfers that were bandwidth throttled in the future.
The decision on holding back can be based on whether or not SPDY is in use if you wait until you have a connection to make the decision - spdy is a prioritized and muxed-on-one-tcp-session protocol that doesn't need to use this kind of workaround to do the right thing. In its case we should just send the requests as soon as possible with appropriate priorities attached to them and let the server do the right thing. The best of both worlds!

This issue illustrates again how dependent HTTP/1 is on TCP behavior and how that is a shortcoming of the protocol. Reasonable performance demands parallelism, but how much is dependent on the semantics of the content, the conditions of the network, and the size of the content. These things are essentially unknowable and even characterizations of the network and typical content change very quickly. Its essential for the health of the Internet that we migrate away from this model onto HTTP/2.

Comments over here please: https://plus.google.com/100166083286297802191

Thursday, December 8, 2011

SPDY, Bufferbloat, HTTP, and Real-Time Networking

Long router queue sizes on the web continue to be a hot networking topic - Jim Gettys has a long interview in ACM queue. Large unmanaged queues destroy low latency applications - just ask Randell Jessup

A paper like this does a good job of showing just how bad the situation can be - experimentally driving router buffering delay from 10ms to ~1000ms on many common broadband cable and DSL modems. I wish the paper had been able to show me the range and frequency of that queue delay under normal conditions.

I'm concerned that decreasing the router buffer size, thereby increasing the drop rate, is detrimental to the current HTTP/1.x web. A classic HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't save you much - it has sent much of what it needs to send already anyhow. Unless you drop almost all of that flow from your buffers you haven't achieved much. Further, a loss event has a high chance of damaging the flow more seriously than you intended - dropping a SYN or the last packet of the data train is a packet that will have very slow retry timers, and short flows are comprised of high percentages of these kinds of packets. non-drop based loss notification like connex/ecn do less damage but are ineffective again because the short flow is more or less complete when the notification arrives so it cannot adapt the sending rate.

The problem is all of those other parallel HTTP sessions going on that didn't get the message. Its the aggregate that causes the buffer build up. Many sites commonly use 60-90 separate uncoordinated TCP flows just to load one page.

Making web transport more adaptable on this front is a big goal of my spdy work. When spdy consolidates resources onto the same tcp flow it means the remaining larger flows will be much more tcp friendly. Loss indicators will have a fighting chance of hitting the flow that can still backoff, and we won't have windows growing independently of each other. (Do you like the sound of IW=10 times 90? That's what 90 uncorrelated flows mean. IW=10 of a small number of flows otoh is excellent.). That ought to keep router queue sizes down and give things like rtcweb a fighting chance.

It also opens up the possibility of the browser identifying queue growth through delay based analysis and possibly helping the situation out by managing inside the browser our bulk tcp download rate (and definitely the upload rate) by munging the rwin or something like that. If it goes right it really shouldn't hurt throughput while giving better latency to other applications. It's all very pie in the sky and down the road, but its kind of hard to imagine in the current HTTP/1.x world.

Friday, September 23, 2011

SPDY: What I Like About You.

I've been working on implementing SPDY as an experiment in Firefox lately. We'll have to see how it plays out, but so far I really like it.

Development and benchmarking is still a work in progress, though interop seems to be complete. There are several significant to-do items left that have the potential to improve things even further. The couple of anecdotal benchmarks I have collected are broadly similar to the page load time based reports Google has shared at the IETF and velocity conf over the last few months.

tl;dr; Faster is all well and good (and I mean that!) but I'm going to make a long argument that SPDY is good for the Internet beyond faster page load times. Compared to HTTP, it is more scalable, plays nicer with other Internet traffic and brings web security forward.

SPDY: What I Like About You

#1: Infinite Parallelism with Shared Congestion Control.

You probably know that SPDY allows multiplexing of multiple HTTP resources inside one TCP stream. Unlike the related HTTP mechanisms of pipelining concurrent requests on one TCP stream, the SPDY resources can be returned in any order and even mixed together in small chunks so that head of line blocking is never an issue and you never need more than one connection to each real server. This is great for high latency environments because a resource never needs to be queued on either the client or the server for any reason other than network congestion limits.

Normal HTTP achieves transaction parallelism through parallel TCP connections. Browsers limit you to 6 parallel connections per host. Servers achieve greater parallelism by sharding their resources across a variety of host names. Often these host names are aliases for the same host, implemented explicitly to bypass the 6 connection limitation. For example, lh3.googleusercontent.com and lh4.googleusercontent.com are actually DNS CNAMEs for the same server. It is not uncommon to see performance oriented sites, like the Google properties, shard things over as many as 6 host names in order to allow 36 parallel HTTP sessions.

Parallelism is a must-have for performance. I'm looking at a trace right now that uses the afore mentioned 36 parallel HTTP sessions and its page load completes in 16.5 seconds. If I restrict it to just 1 connection per host (i.e. 6 overall), the same page takes 27.7 seconds to load. If I restrict that even further to just 1 connection in total in takes a mind numbing 94 seconds to load. And this is on 40ms RTT broadband - high latency environments such as mobile would suffer much much worse! Keep this in mind when I start saying bad things about parallel connections below, they really do great things and the web we have with them enables much more impressive applications than a single connection HTTP web ever could.

Of course using multiple parallel HTTP connections is not perfect - if they were perfect we wouldn't try to limit them to 6 at a time. There are two main problems. The first is that each connection requires a TCP handshake which incurs an extra RTT (or maybe 3 if you are using SSL) before the connection can be used. The TCP handshake is also relatively computationally hard compared to moving data (servers easily move millions of packets per second, while connection termination is generally measured in the tens of thousands), the SSL handshake even harder. Reducing the number of connections reduces this burden. But in all honesty this is becoming less of a problem over time - the cost of maintaining persistent connections is going down (which amortizes the handshake cost) and servers are getting pretty good at executing the handshakes (both SSL and vanilla) sometimes by employing the help of multi-tiered architectures for busy deployments.

The architectural problem lies in HTTP's interaction with TCP congestion control. HTTP flows are generally pretty short (a few packets per transaction), tend to stop and start a lot, and more or less play poorly with the congestion control model. The model works really well for long flows like a FTP download - that TCP stream will automatically adapt to the available bandwidth of the network and transfer at a fairly steady rate for its duration after a little bit of acclimation time. HTTP flows are generally too short to ever acclimate properly.

A SPDY flow, being the aggregation of all the parallel HTTP connections, looks to be a lot longer, busier, and more consistent than any of the individual parallel HTTP flows would be. Simply put - that makes it work better because all of that TCP congestion logic is applied to one flow instead of being repeated independently across all the parallel HTTP mini flows.

Less simply, when an idle HTTP session begins to send a response it has to guess at how much data should be put onto the wire. It does this without awareness of all the other flows. Let's say it guesses "4 packets" but there are no other active flows. In this case 4 packets is way too few and the network is under utilized and the page loads poorly. But what if 35 other flows are activated at the same time - this means 140 packets get injected into the network at the same time which is way too many. Under that scenario one of two things happen - both of them are bad:

Packet Loss. TCP reacts poorly to packet loss, especially on short flows. While 140 packets in aggregate is a pretty decent flow, remember that total transmission is made up of 35 different congestion control blocks - each one covering a packet flow of only 4 packets. A loss is devastating to performance because most of the TCP recovery strategies don't work well in that environment.
Over Buffering. This is what Jim Gettys calls bufferbloat. The giant fast moving 140 packet burst arrives at your cable modem head where the bandwidth is stepped down and most of those packets get in a long buffer to wait for their turn on your LAN. That works OK, certainly better than packet loss recovery does in practice, but the deep queue creates a giant problem for any interactive traffic that is sharing that link. Packets for those other applications (such as VOIP, gaming, video chat, etc..) now have to sit in this long queue resulting in interactive lag. Lag sucks for the Internet. The HTTP streams themselves also become non-responsive to cancel events because the only way to clear those queues is to wait them out - so clicking on a link to a new page is significantly delayed while the old page that you have already abandoned continues to consume your bandwidth.

This describes a real dilemma - if you guess more aggressive send windows then you will have a better chance of filling the network but you will also have a better chance of packet loss or over buffering. If you guess more conservative windows then loss and buffering happens less often but nothing ever runs very quickly. In the face of all those flows with independent congestion control blocks, there just isn't enough information available. (This of course is related to the famous Initial Window 10 proposal, which I support, but that's another non SPDY story.)

I'm sure you can see where this is going now. SPDY's parallelism, by virtue of being on a single TCP stream, leverages one busy shared congestion control block instead of dealing with 36 independent tiny ones. Because the stream is much busier it rarely has to guess at how much to send (you only need to guess when you're idle, SPDY is more likely to be getting active feedback), if it should drop a packet it reacts to that loss much better via the various fast recovery mechanisms of TCP, and when it is competing for bandwidth at a choke point it is much more responsive to the signals of other streams - reducing the over buffering problem.

It is for these reasons that SPDY is really exciting to me. Parallel connections work great - they work so great that it is hard to have SPDY significantly improve on the page load time of highly sharded site unless there is a very high amount of latency present. But the structural advantages of SPDY enable important efforts like RTCWeb as well as provide better network utilization and help servers scale when compared to HTTP. Even if page load times only stay at par, those other good for the Internet attributes make it worth deploying.

#2: SPDY is over SSL every time.

I greatly lament that I am late to the school of SSL-all-the-time. I spent many years trying to eek the greatest amount of server responses per watt that was possible. I looked at SSL and saw impediments. That stayed with me.

I was right about the impediments, and I've learned a lot about dealing with them, but what I didn't get was that it is simply worth the cost. As we have all seen lately, SSL isn't perfect - but having a layer of protection against an entire class of eavesdropping attacks is a property that should be able to be relied upon in every protocol as generic as HTTP. HTTP does not provide that guarantee - but SPDY does.

huzzah.

As a incentive to make the transition to SSL all the time, this makes it worth deploying by itself.

#3:Header compression.

SPDY compresses all the HTTP-equivalent headers using a specialized dictionary and a compression context that is reserved only for the headers so it does not get diluted with non-header references. This specialized scheme performs very well.

At first I really didn't think that would matter very much - but it is really a significant savings. HTTP's statelessness had its upsides, but the resulting on the wire redundancy was really over the top.

As a sample, I am looking right now at a trace of 1900 resources (that's about 40 pages). 760KB of total downstream plain text header bytes were received as 88KB compressed bytes, and upstream 949KB of plain text headers were compressed as just 65KB. I'll take 1.56MB (82%) in total overhead savings! I even have a todo item that will make this slightly better.

Tuesday, January 11, 2011

Firefox Idle Connection Selection via CWND

When choosing between more than 1 idle persistent connection FF <= 4.0 goes with a FIFO approach. I was thinking about ways to tune this.

I was going to experiment with making that a LIFO. A LIFO afterall should have better cache behavior and would allow the cache size to shrink to a more natural size as the older members remain untouched and timeout. The FIFO will basically keep the size pinned at its maximum while it cycles through all the connections which wastes system RAM with both the client and server maintaining extra idle TCP sessions. It is also the worst possible processor cache behavior. The possible argument in favor of FIFO is actually that connections are expensive to open and we've already opened these so maybe we want to keep it pinned to its maximum size just in case we need it again - it isn't obvious what to do or if it matters much.

Thinking a little further I realized that the major differentiator between these sockets is not a timestamp at all - it is the CWND associated with them on the server. Many web servers at least partially ignore the TCP suggestion to return to slow start after an RTO of idle activity so it is reasonable to assume that some of the connections have larger CWNDs than others as long as we aren't in an environment where the previous transfers have been actually bottlenecked by bandwidth - and on the web that almost never happens, RTT and document size bottleneck most transfers. Even if available bandwidth is the real bottleneck that just moots our metric, it doesn't provide any information that steers us the wrong way.

When choosing which connection to use we want to choose the one that has the largest CWND on the server. Unfortunately we cannot see directly into the congestion information on the server, but assuming that previous transfers have been bottlenecked by RTT (approximately a constant for all connections to the same server) and transfer size then we can just use the one with the history of moving the largest amount of data in one burst as a proxy for it because that burst is what opens the congestion window.

I say one burst instead of one document because we want to include pipelines of multiple HTTP transactions as long as there wasn't an RTT between them. This is another reason to want pipelines - they will more effectively open up the CWND for us. We also want to use the maximum burst seen on the connection, not the necessarily the last burst - the servers we are interested in don't shrink the CWND just because it isn't all being used for each burst.

The implementation is easy - the idleconnection list is changed from being sorted as a FIFO to being sorted according to this "maxburst" metric. The code and bugzilla are here.

Using an experiment designed to show the best case, the results are better than I expected for such a minor tweak. This was my process:

construct a base page mixed with several small and several large images plus a link to a 25KB image. There are 6 objects on the base page.
load the base page - FF4 will use six parallel connections to do it
click on the link to the 25KB image - this will use an idle persistent connection. Measure the performance of this load.

There is 250ms of RTT and several megabits of bandwidth between the client and server in my test.

As expected, vanilla FF 4.0 (beta 9) loads the target image on an idle persistent connection that was used to transfer one of the smallest images on the base page. It is FIFO afterall, and the small image load was going to finish first and be put into the persistent connection pool first.

My patched browser, using the sort by CWND algorithm, loads the target image on the idle persistent connection that was used to transfer the largest image (2MB+) from the base page.

Their history is the only difference between the two connections - at the time of requesting the 25KB image they are both connected and idle. There is a 250ms RTT between my host and the server.

The difference is huge. The default algorithm takes 3 round trips to transfer the document as it works its way through growing the congestion window. That adds up to 793ms. The sort-by-cwnd algorithm inherits a congestion window large enough for the task at hand and moves it all in one stream - just 363ms.

This is a nice tweak, but it has its limitations - by definition you cannot meaningfully multiply the gain across a large number of transactions. If you have a large number of transactions then you probably are using all your idle connections in a burst and there is no point in discriminating between them if you are just going to use them all.

I would argue if you have that much pressure on the connection pool then you are probably queueing requests and should be using pipelining. If you don't have that much pressure, then this probably helps you.

Tuesday, November 30, 2010

The Value of HTTP Pipelines

For the past few months I've been on a personal quest to implement a safe and effective HTTP pipelining strategy for Mozilla Firefox. The pipeline concept is part of HTTP/1.1 but for various reasons has not been widely deployed on the web.

I'm going to make a series of three posts. This one will be basic background information - what pipelines are in the abstract and why they are more relevant to today's web than they have been in years. The second post will be about the details of my Firefox implementation and its mechanisms for dealing with the realities of the occasional pipeline-hostile piece of infrastructure. The last post will share some performance metrics of those patches in common use cases.

HTTP Pipelines - 101

A client forms a pipeline simply by sending a second HTTP request down an HTTP connection before it has received the first response. A pipeline may be more than two transactions deep. HTTP responses are still required to be returned in the order their requests arrived.

This is a simple concept, but the benefits are profound.

Chiefly, the pipelined transactions benefit by eliminating the latency involved in sending their request. The greater the latency, the greater the benefit.

This crude diagram shows 2 normal HTTP transactions without pipelining. In it each request is shown with a rightward arrow and the response is shown with a leftward arrow. The empty space represents the network latency. When the first response is received, the second is sent.

When pipelining is used, the picture becomes different. Even though the request and response sizes are the same, and the round trip time between the hosts is the same, the overall transaction completes sooner. It is obvious that the round trip latency between the client and the server has been mitigated as a problem.

The greater the latency and the deeper the pipeline the more benefit is captured.

While bandwidth continues to improve, latency is not keeping up. The trend is toward mobile networks and those have latencies 5x to 20x worse than broadband WAN connections. This increased latency is an opportunity for pipelining.

HTTP Pipelines - 201

Conceptually, parallel HTTP connections can garner similar benefits. Afterall, independent HTTP connections do not need to wait for each other to finish before beginning their work.

Up to a point, that is completely true. Parallelism has been the mainstay for many years and served us well. But it has its limits.

The chief limit is actually specified. No more than 6 (or 4, or 2 depending on the point in history) simultaneous connections are supposed to be allowed from the user agent to the server. This is a serious handicap - Firefox may easily have a request queue of over 100 images, stylesheets, and javascript items to fetch upon parsing a Facebook HTML page full of photos, emoticons, ads, and widgets. A single digit number of connections is far too small to effectively parallelize that download.

This brings us to the reason there is a limit on the number of connections. Making a new TCP connection sucks. Before it can be used at all it requires a high latency three way handshake and if it is over TLS an expensive RSA operation on the server (and yet another round trip). Making the new connection requires access to shared data structures on the server in a way that using an existing connection does not and harms scalability. Servers that can pump multiple gigabits a second of TCP data in their sleep through a few connections, can still only initiate on the order of tens of thousands of connections a second.

Maintaining all those connections is expensive for the server too. Each one takes state (i.e. RAM in the kernel, perhaps a thread stack in the application), and the server now has to deal with sorting through more control blocks and application states everytime a packet comes in in order to match it with the right one. Because of the increased diversity of TCP control blocks in use, the L2/L3 cache is severely polluted which is the number one factor in high performance TCP.

Even if pipelines only mean browser performance equal to parallel connections but accomplished using fewer connections then that is a win for Web scalability.

HTTP Pipelines - 301

The details of TCP start to weigh in heavily in favor of pipelining when parallelism and pipelining are compared.

Most significantly, TCP slow-start performs poorly in the parallelized environment. To recap what you already know: Slow start requires the sender to send a conservative amount of data at first (typically 3 or 4 packets) and then wait a full round trip time to receive positive feedback in the form of an ACK from the recipient. At that time it can grow the window a little bit and send another burst and then wait again. After enough iterations of this the sender is generating enough traffic in each burst to "fill the window" and it no longer has to wait.

Pipelining, of course, does not eliminate slow start. But it does use fewer connections to move the same amount of data and the slow-start penalty scales with the number of connections not the amount of data. Fewer connections mean less data is transferred under the artificially slow conditions of slow start.

There is another wrinkle that applies to even parallel connections that have paid their start-up dues and progressed past slow start. TCP connections that have not sent data within an RTO (think of an RTO as a round trip time plus a grace period) are supposed to go back to slow-start! From a TCP point of view this makes some sense - the inactivity means the TCP stack has lost the ack-clock that it needs to use the connection aggressively. But for even a persistent use of a parallel HTTP connection this is a disaster. Effectively each transaction response must go through slow start because it takes more than a round trip for the last packet of the previous response to travel to the client, the client to form the next request, the request to travel to the server and the server to form the response. Pipelined connections do not have that problem - one response follows immediately on the heels of the previous response which ensures optimal TCP use of the available bandwidth. Huzzah.

Lastly, it is worth talking about Google's push for larger initial windows during slow start. They support a value of 10 instead of the current 3 or 4. Basically I think this is a good thing - the Internet can support larger bursts than we are currently sending. Unfortunately, this has a terrible potential interaction with parallel connections. 6 connections would essentially mean 60 congestion control packet credits available to be sent in a burst at any time. Just as 3 is too few for a single connection environment, 60 is too many for a multiple connection environment. Whereas pipelines reduce the reliance on parallel connections, they bring down the total number of packet credits outstanding at any time while still allowing for more effective slow start capacity probing. That's a win win in my book.

Friday, August 20, 2010

Characterizing Delays Caused by TCP in-order

If packet N+1 of a TCP flow arrives before packet N, the receiving application does not see any data until packet N gets there. That's what we mean when we say TCP guarantees in-order delivery. That is also true if N+1 through N+100 get there before N - nobody gets through until they can all be delivered in-order.

At least using the BSD socket API.

I got to thinking about the impact of this when discussing a multiplexing implementation of various logical streams on top of TCP. For instance, SPDY and BEEP do things along those lines in order to create efficiencies in terms of more accurate congestion control data. But as someone objected, that creates a certain amount of fate sharing between the different streams that wouldn't exist if they were on separate TCP channels. A packet loss in one of them creates a delay for them both even though throughput might very well be maintained using some variation of fast-retransmit and large windows.

So the question: how often are packets received but the data in them is delayed by he kernel because the stream isn't yet in order? And how long are those delays?

I don't know yet, but I wrote some crude linux kernel patches to find out. When a skb is moved out of the out of order queue a structure with 2 timestamps (in queue, out of queue) is passed to userspace through the netlink connector mechanism. It also reports the total number of received data packets on each TCP stream. That way we can find out, how often and how long.

I'm running the hack on my desktop now.

Tuesday, February 16, 2010

Googling Harder

A while back I mentioned that Google thinks TCP ought to be more aggressive.

I must admit, this matches my own bias. I can barely count the number of applications I have watched wait for network I/O when there was plenty of CPU and idle bandwidth available. It's maddening. Sometimes it's slow start or another aspect of congestion control, sometimes it is outdated things like the nagle algorithm.

Well, Google is back at it with this slide set. (PDF)

They make the argument for increasing the initial cwnd. More provocatively, they argue that the Web has already done so in a defacto way by going to aggressive numbers of independent parallel HTTP connections (where you essentially get new cwnd credits just for opening a new TCP stream). Clever argument. Maybe you want to pace the data after 3 or 4 packets based on the RTT of the handshake - so you don't overrun any buffers un-necessarily.

Frankly, this kind of thing can be implemented on the server side without ever telling the peer. It would make some sense for Google to just do this for a few different values of cwnd on a tiny fraction of their traffic and see if the packet loss rates change and then publish that.

Monday, July 20, 2009

What is eating those Google SYN-ACKs?

In this post, I mentioned google was seeing huge packet loss on syn-acks from their servers. At the time it looked like 2%. That sounded nuts.

It still sounds nuts.

Someone else on the mailing list posted about that, and Jerry Chu of Google confirmed it:

Our overall pkt retransmission rate often goes over 1%. I was
wondering if SYN/SYN-ACK pkts are less likely to be dropped
by some routers due to their smaller size so we collected traces
and computed SYN-ACK retransmissions rate on some servers.
We confirmed it to be consistent with the overall pkt drop rate,
i.e., > 1% often.

You could imagine why the overall retransmission rate might be higher than the real drop rate due to jitter and various fast retransmit algorithms that might retransmit things that just hadn't been acknowledged quite yet. Even SYNs might be dropped at the host (instead of the network) due to queue overflows and such.. but we're talking about SYN-ACKs from busy servers towards what one would expect would be pretty idle google-searching clients. And these SYN-ACKs have giant timeouts (3 seconds - which is why Jerry was writing in the first place) so it certainly isn't a matter of over-aggressive retransmit. The only explanation seems to be packet loss. At greater than 1%

wow.

This probably has more to do with the global nature of google's audience than anything else. But still, TCP can really suck at loss rates that high. It must be very different than the desktop Internet I know (which is a fair-to-middlin cable service, not a fancy Fiber-To-The-Home setup which is becoming more common.)

I wonder exactly where those losses happen.

Tuesday, July 14, 2009

Google Thinks TCP should be more Aggressive by Default

Really interesting post from Jerry Chu of Google. He says Google has data which shows that we ought to lower the initial RTO, increase the initial CWND, drop the min RTO, and reduced the delayed ack time out in TCP.

Based on my own anecdotal data, I've done stuff like that in products I've worked on. Let's face it - 3 seconds is a freaking eternity. Processors, networks, and busses have all scaled but these constants remain the same. Jerry says Google has data that shows this is important. As the google data set is no doubt much more extensive than any I worked with, that's a really welcome post.

Probably the most important data point Jerry shares is that "up to a few percentage points" in his data set exhibit a SYN-ACK retransmission from the google servers. Wow. (at least) 1 in 50 syn-acks needs to be retransmitted? That's not my experience at all, and if true on google scale it is absolutely fascinating. Are they generally seeing 2% packet loss on google tx? There's no way that they are seeing that.. google would appear to suck! So what's going on... ? Why is syn-ack rexmitted more than anything else? (and I'm assuming they are indeed lost, because otherwise lowering the timeout wouldn't be the right remedy..)

Friday, February 6, 2009

Increasing Upload Speed from Firefox on Windows

Sometimes bugs are more interesting to work on than features - they have that mysterious quality about them and give a satisfying feeling when you figure it out.

This one was brought to my attention by Mark Finkle.

It basically boiled down to HTTP POSTs from Firefox on Windows being slower than they are in Internet Explorer, and also slower than they are in Firefox on OS X or Linux. (IE on windows and the non-windows platforms all perform about the same, with FF on windows lagging behind).

The culprit turned out to be the TCP congestion window. Firefox never had more than 8KB of un-acked data outstanding. If you have a network path with a high bandwidth-delay product, that isn't going to cut it.

Windows (up to and including Vista) has an 8KB default sending window. Or so I found out thanks to Google.

Autotuning that buffer size is standard practice on OS X and Linux and has been for a long while. Vista autotunes the receive buffer (but not XP according to what I read), but the send buffer is a small fixed value. IE, realizing that its a web 2.0 kinda world out there full of User Generated Content, must increase that value from its default - because I can look at the IE tcpdump traces and see >80KB of un-acknowledged data (there would be more, but the max window size is not the limit at whatever value they have it set to) in the same way I do with a trace of Firefox on Linux.

The Linux default is 128KB for any reasonably modern machine.

Fortunately that can be controlled on a system wide basis through a registry preference, or on a per socket basis by setting SO_SNDBUF. I submitted a patch that does the latter if the network.tcp.sendbuffer preference is set - the patch also sets the pref for windows.

If you would suffer from this, I see three options:
1] Wait until my patch (or a later rev of it) ends up in an official build
2] Set the registry property to change it for your whole Windows install - KB 950326.
3] you might be able to build a k3wl binary add-on that does the same thing as my patch in a crazy way. Fame, fortune, and faster flickr and picasa uploads await you.

Saturday, April 5, 2008

Linux Selective Acknowledgment (SACK) CPU Overhead

Last year I tossed an e-mail from the Linux kernel networking list in my "projtodo" folder.

The mail talked about how the Linux TCP stack in particular, and likely all TCP stacks in general, likely had an excessive-CPU attack exposure when confronted with malicious SACK options. I found the mail intriguing but unsatisfying. It was well informed speculation but didn't have any hard data, nor was there any simple way to gather some. Readers of the other posts on this blog will know I really dig measurements. The issue at hand was pretty obviously is a problem - but how much of one?

A few weeks ago I had the chance to develop some testing code and find out for myself - and IBM DeveloperWorks has published the summary of my little project. The executive summary is "its kinda bad, but not a disaster, and hope is on the way". There is had data and some pretty pictures in the article itself.

The coolest part of the whole endeavor, other than scratching the "I wonder" itch, was getting to conjure up a userspace TCP stack from raw sockets. It was, of course, woefully incomplete as it was just meant to trigger a certain behavior in its peer instead of being generally useful or reliable - but nonetheless entertaining.

Saturday, August 4, 2007

Lamenting ECN deployments

Explicit Congestion Control (ECN) has had lots of papers written about it. It has also been in various stages of deployment for almost 10 years now. I generally agree with those that feel it is a good thing. Beyond the scientific data, always important of course, at a gut level separating data loss from congestion notification is an obviously good thing to do - they are simply different things and the implicit overload TCP currently uses results in creating un-necessary overflows and over-conservative backoffs.

But the sad fact is that ECN just isn't a relevant to the real world Internet. If you can't become relevant in 8 years or so, it is probably time to try something else. For a long time the issue was getting over interop problems with NATs and Firewalls. Then there was the matter of getting widespread client and router deployments. Linux has had client support for a long time, other Unixes more recently, and Microsoft Vista is the first MS OS to include ECN support at all - but it ships disabled by default.

However, it seems that even as clients are catching up, the routing infrastructure still isn't playing along.

I took a sample from my home network to see if ECN was relevant at all to my computing life. The summary is that ECN is irrelevant in my data set.

My network runs a lot of Linux with ECN enabled, so if anything ECN will be over-represented on my network compared to the Internet at large. The sample covered

8.5 days
1,227,473 IP packets
40,914 TCP flows

Of those 41K flows, almost 8 percent (3268) negotiated ECN on between peers. As I already mentioned, I suspect this pretty meager number is higher that the Internet at large.

8% - that's great and might make a real contribution. But without router support, those eligible flows won't actually use ECN in any meaningful sense. There are two signs of ECN actually taking root with an intermediate router:

The TCP ECN ECHO flag bit is set on an incoming packet.. this tells the receiver that an earlier packet it sent was marked by a router on its way to the destination. The peer is setting this flag in order to tell the original sender to slow down so that doesn't keep happening.
The IP header ECN codepoint is set to 3 on an incoming packet. This indicates to the receiver that the packet hit some congestion on the way and a router set this bit to mark that fact.

There were no ECN ECHO flags in the entire trace, except those set on SYN packets where it is used in order to negotiate endpoint knowledge of ECN. When appearing on a SYN it is not a sign of router support.

There were 87 packets (over 8 different TCP flows) with the ECN codepoint of 3 - normally indicating congestion marks added by a router. However, that is a bogus conclusion in this case and it does not appear that in 1.2 million packets I have a single one that was marked as congestion-influenced.

I can say conclusively that the 87 codepoint 3 packets are false positives because none of the 8 flows containing those packets were included in the 8% of flows that had successfully negotiated the use of ECN. The peer must have been using the codepoint to indicate something different entirely. Each flow was with a different host, though they were all SMTP MTA based in Europe (7 of them in Germany).

Sally Floyd says on her webpage:

"David Moore from CAIDA reports that in measurements at one link, 0.1% of the packets had the CE codepoint set. Either this codepoint was being used for some other purpose, or there is some deployment of ECN capability in routers as well as in TCP stacks."

My data at least hints that the hope that there is some deployment of ECN capability in routers is over-optimistic.

The final wrapup - my traces are characterized by meager (< 10%) client support no indication of router support at all.

I was inspired to take a look at this by the work of Bob Briscoe, who is championing a IETF BoF on Re-ECN, which is an attempt to make lemonade from lemons and re-invigorate the basic underlying technology.

Friday, July 6, 2007

In Support of Math Based Computer Science

Earlier in my day, I ran across a book review for Computer Science Reconsidered: The Invocation Model of Process Expression. The premise of the book, at least garnered second hand from the review:

Mathematicians and computer scientists are pursuing fundamentally different aims, and the mathematician's tools are not as appropriate as was once supposed to the questions of the computer scientist

I had a number of reactions to that immediately. Most of them were, frankly, emotional. Certainly most of the code that gets churned out these days has very little conscious basis in Mathematics. But I would argue it doesn't have much of a conscious basis in Computer Science, Algorithms, or Finite Automata either which are all definitely critical some of the time. That is largely because so much of it is so well understood, and so abstracted away from first principles, that the underlying rigor isn't required to get the day to day bits churned out. The more important the code is, the more it moves down that spectrum of rigor.

But when we're really reaching for something new, something interesting, and something that isn't just an incremental change from the conventional wisdom, then my instincts say that Math provides a very valuable framework for describing something out of nothingness.

My gut was validated just hours later when reading an IEEE journal article on the so-called FastTCP active queue management TCP congestion control algorithm. The approach looks at congestion control as "a distributed algorithm over the Internet to solve a global optimization problem [.. to ..] determine the equilibrium and performance of the network"

Moreover, the underlying optimization problem has a simple structure that allows us to efficiently compute these equilibrium properties numerically, even for a large network that is hard to simulate.

Specifically, we can regard each source as having a utility function that measures its “happiness” as a function of its data rate. Consider the problem of maximizing the sum of all source utility functions over their rates, subject to link capacity constraints. This is a standard constrained optimization problem for which many iterative solutions exist. The challenge in our context is to solve for the optimal source rates in a distributed manner using only local information. A key feature we exploit is the duality theory. It says that associated with our (primal) utility maximization problem is a dual minimization problem. Whereas the primal variables over which utility is to be maximized are source rates, the dual variables for the dual problem are congestion measures at the links. Moreover, solving the dual problem is equivalent to solving the primal problem. There is a class of optimization algorithms that iteratively solve for both the primal and dual problems at once.

TCP/AQM can be interpreted as such a primal-dual algorithm that is distributed and decentralized, and solves both the primal and dual problems. TCP iterates on the source rates (a source increases or decreases its window in response to congestion in its path), and AQM iterates on the congestion measures (e.g., loss probability at a link increases or decreases as sources traversing that link increase or decrease their rates). They cooperate to determine iteratively the network operating point that maximizes aggregate utility. When this iterative process converges, the equilibrium source rates are optimal solutions of the primal problem and the equilibrium congestion measures are optimal solutions of the dual problem. The throughput and fairness of the network are thus determined by the TCP algorithm and the associated utility function, whereas utilization, loss, and delay are determined by the AQM algorithm.

It seems clear here that math provides very strong underpinning for what the article needs to describe and achieve. To be fair to the author of the original book, he was trying to promote another basis for expressing key Computer Science thoughts: "the invocation model of process expression". Which from casual glance looks interesting, I just don't get why you have to tear down something old (e.g. "The Problem: Why the underlying theory of contemporary computer science is not helpful") in order to build up something new.

Maybe being shocking is good for selling books. Though, I'm not sure the labeling of math as not helpful is all that shocking to the general book buying population.

Check out www.fastsoft.com where some of the authors of that paper have created a clever hardware bridge to seamlessly migrate a legacy TCP data center into one that sends with FastTCP congestion control algorithm.