Bits Up!: performance

Showing posts with label performance. Show all posts

Thursday, April 6, 2017

On The Merits of QUIC for HTTP

I am often asked why the Internet HTTP community is working on an IETF based QUIC when HTTP/2 (RFC 7540) is less than 2 years old. There are good answers! This work is essentially part two of a planned evolution to improve speed, reliability, security, responsiveness, and architecture. These needs were understood when HTTP/2 was standardized but the community decided to defer larger architectural changes to the next version. That next version is arriving now in the form of QUIC.

Flashback to HTTP/2 Development

HTTP/2 development consciously constrained its own work in two ways. One was to preserve HTTP/1 semantics, and the other was to limit the work to what could be done within a traditional TLS/TCP stack.

The choice to use TLS/TCP in HTTP/2 was not a forgone conclusion. During the pre-HTTP/2 bakeoff phase there was quite a bit of side channel chatter about whether anyone would propose a different approach such as the SCTP/UDP/[D]TLS stack RTC DataChannels was contemporaneously considering for standardization.

In the end folks felt that experience and running code were top tier properties for the next revision of HTTP. That argued for TCP/TLS. Using the traditional stack lowered the risk of boiling the ocean, improved the latency of getting something deployed, and captured the meaningful low hanging fruit of multiplexing and priority that HTTP/2 was focused on.

Issues that could not be addressed with TCP/TLS 1.2 were deemed out of scope for that round of development. The mechanisms of ALPN and Alternative Services were created to facilitate successor protocols.

I think our operational experience has proven HTTP/2's choices to be a good decision. It is a big win for a lot of cases, it is now the majority HTTPS protocol, and I don't think we could have done too much better within the constraints applied.

The successor protocol turns out to be QUIC. It is a good sign for the vibrancy of HTTP/2 that the QUIC charter explicitly seeks to map HTTP/2 into its new ecosystem. QUIC is taking on exactly the items that were foreseen when scoping HTTP/2.

This post aims to highlight the benefits of QUIC as it applies to the HTTP ecosystem. I hope it is useful even for those that already understand the protocol mechanics. It, however, does not attempt to fully explain QUIC. The IETF editor's drafts or Jana's recent tutorial might be good references for that.

Fixing the TCP In-Order Penalty

The chief performance frustration with HTTP/2 happens during higher than normal packet loss. The in-order property of TCP spans multiplexed HTTP messages. A single packet loss in one message prevents subsequent unrelated messages from being delivered until the loss is repaired. This is because TCP delays received data in order to provide in-order delivery of the whole stream.

For a simple example imagine images A and B each delivered in two parts in this order: A1, A2, B1, and B2. If only A1 were to suffer a packet loss under TCP that would also delay A2, B1, and B2. While image A is unavoidably damaged by this loss, image B is also impacted even though all of its data was successfully transferred the first time.

This is something that was understood during the development of RFC 7540 and was correctly identified as a tradeoff favoring connections with lower loss rates. Our community has seen some good data on how this has played out "in the wild" recently from both Akamai and Fastly. For most HTTP/2 connections this strategy has been beneficial, but there is a tail population that actually regresses in performance compared to HTTP/1 under high levels of loss because of this.

QUIC fixes this problem through multistreaming onto one connection in a way very familiar to 7540 but with an added twist. It also gives each stream its own ordering context analagous to a TCP sequence number. These streams can be delivered independently to the application because in-order only applies to each stream instead of the whole connection in QUIC.

I believe fixing this issue is the highest impact feature of QUIC.

Starting Faster

HTTP/2 starts much faster than HTTP/1 due to its ability to send multiple requests in the first round trip and its superior connection management results in fewer connection establishments. However, new connections using the TCP/TLS stack still incur 2 or 3 round trips of delay before any HTTP/2 data can be sent.

In order to address this QUIC eschews layers in favor of a component architecture that allows sending encrypted application data immediately. QUIC still uses TLS for security establishment and has a transport connection concept, but these components are not forced into layers that require their own round trips to be initialized. Instead the transport session, the HTTP requests, and the TLS context are all combined into the first packet flight when doing session resumption (i.e. you are returning to a server you have seen before). A key part of this is integration with TLS 1.3 and in particular the 0-RTT (aka Early Data) handshake feature.

The HTTP/2 world, given enough time, will be able to capture some of the same benefits using both TLS 1.3 and TCP Fast Open. Some of that work is made impractical by dependencies on Operating System configurations and the occasional interference from middleboxes unhappy with TCP extensions.

However, even at full deployment of TLS 1.3 and TCP Fast Open, that approach will lag QUIC performance because QUIC can utilize the full flight of data in the first round trip while Fast Open limits the amount of data that can be carried to the roughly 1460 bytes available in a single TCP SYN packet. That packet also needs to include the TLS Client Hello and HTTP SETTINGS information along with any HTTP requests. That single packet runs out of room quickly if you need to encode more than one request or any message body. Any excess needs to wait a round trip.

Harmonizing with TLS

When used with HTTP/1 and HTTP/2, TLS generally operates as a simple pipe. During encryption cleartext streams of bytes go in one side and a stream of encrypted bytes come out the other and are then fed to TCP. The reverse happens when decrypting. Unfortunately, the TLS layer operates internally on multi-byte records instead of a byte stream and the mismatch creates a significant performance problem.

The records can be up to 64KB and a wide variety of sizes are used in practice. In order to enforce data integrity, one of the fundamental security properties of TLS, the entire record must be received before it can be decoded. When the record spans multiple packets a problem similar to the "TCP in-order penalty" discussed earlier appears.

A loss to any packet in the record delays decoding and delivery of the other correctly delivered packets while the loss is repaired. In this case the problem is actually a bit worse as any loss impacts the whole record not just the portion of the stream following the loss. Further, because application delivery of the first byte of the record is always dependent on the receipt of the last byte of the record simple serialization delays or common TCP congestion-control stalls add latency to application delivery even with 0% packet loss.

The 'obvious fix' of placing an independent record in each packet turns out to work much better with QUIC than TCP. This is because TCP's API is a simple byte stream. Applications, including TLS, have no sense of where packets begin or end and have no reliable control over it. Furthermore, TCP proxies or even HTTP proxies commonly rearrange TCP packets while leaving the byte stream in tact (a proof of the practical value of end to end integrity protection!).

Even the absurd-um solution of 1 byte records does not work because the record overhead creates multibyte sequences that will still span packet boundaries. Such a naive approach would also drown in its own overhead.

QUIC shines here by using its component architecture rather than the traditional layers. The QUIC transport layer receives plaintext from the application and consults its own transport information regarding packet numbers, PMTU, and the TLS keying information. It combines all of this to form the encrypted packets that can be decrypted atomically with the equivalent of one record per UDP packet. Intermediaries are unable to mess with the framing because even the transport layer is integrity protected in QUIC during this same process - a significant security bonus! Any loss events will only impact delivery of the lost packet.

No More TCP RST Data Loss

As many HTTP developers will tell you, TCP RST is one of the most painful parts of the existing ecosystem. Its pain comes in many forms, but the data loss is the worst.

The circumstances for an operating system generating a RST and how they respond to them can vary by implementation. One common scenario is a server close()ing a connection that has received another request that the HTTP server has not yet read and is unaware of. This is a routine case for HTTP/1 and HTTP/2 applications. Most kernels will react to the closing of a socket with unconsumed data by sending a RST to the peer.

That RST will be processed out of order when received. In practice this means if the original client does a recv() to consume ordinary data that was sent by the server before the server invoked close() the client will incur an unrecoverable failure if the RST has also already arrived and that data cannot ever be read. This is true even if the kernel has sent a TCP ack for it! The problem gets worse when combined with larger TLS record sizes as often the last bit of data is what is needed to decode the whole record and substantial data loss of up to 64KB occurs.

The QUIC RST equivalent is not part of the orderly shutdown of application streams and it is not expected to ever force the loss of already acknowledged data.

Better Responsiveness through Buffer Management

The primary goal of HTTP/2 was the introduction of multiplexing into a single connection and it was understood that you cannot have meaningful multiplexing without also introducing a priority scheme. HTTP/1 illustrates the problem well - it multiplexes the path through unprioritized TCP parallelism which routinely gives poor results. The final RFC contained both multiplexing and priority mechanisms which for the most part work well.

However, successful prioritization requires you to buffer before serializing the byte stream into TLS and TCP because once sent to TCP those messages cannot be reordered in the case of higher priority data presenting itself. Unfortunately high latency TCP, requires a significant amount of buffering at the socket layer in order to run as fast as possible. These two competing interests make it difficult to judge how much buffering an HTTP/2 sender should use. While there are some Operating System specific oracles that give some clues, TCP itself does not provide any useful guidance to the application for reasonably sizing its socket buffers.

This combination has made it challenging for applications to determine the appropriate level of socket buffering and in turn they sometimes have overbuffered in order to make TCP run at line rate. This results in poor responsiveness to the priority schedule and the inability for a server to recognize individual streams being canceled (which happens more than you may think) because they have already been buffered.

The blending of transport and application components creates the opportunity for QUIC implementations to do a better job on priority. They do this by buffering application data with its priority information outside of the transmission layer. This allows the late binding of the packet transmission to the data that is highest priority at that moment.

Relatedly, whenever a retransmission is required QUIC retransmits the original data in one or more new packets (with new packet identifiers) instead of retransmitting a copy of the lost packet as TCP does. This creates an opportunity to reprioritize, or even drop canceled streams, during retransmission. This compares favorably to TCP which is sentenced to retransmitting the oldest (and perhaps now irrelevant) data first due to its single sequence number and in-order properties.

UDP means Universal DePloyment

QUIC is not inherently either a user space or kernel space protocol - it is quite possible to deploy it in either configuration. However, UDP based applications are often deployed in userspace configurations and do not require special configurations or permissions to run there. It is fair to expect a number of user space based QUIC implementations.

Time will tell exactly what that looks like, but I anticipate it will be a combination of self-updating evergreen applications such as web servers and browsers and also a small set of well maintained libraries akin to the role openssl plays in distributions.

Decoupling functionality traditionally performed by TCP from the operating system creates an opportunity for deploying software faster, updating it more regularly, and iterating on its algorithms in a tight loop. The long replacement and maintenance schedules of operating systems, sometimes measured in decades, create barriers to deploying new approaches to networking.

This new freedom applies both QUIC itself, but also to some of its embedded techniques that have equivalents in the TCP universe that have traditionally been difficult to deploy. Thanks to user space distribution packet pacing, fast open, and loss discovery improvements like RACK will see greater deployment than ever before.

Userspace will mean faster evolution in networking and greater distribution of the resulting best practices.

---

The IETF QUIC effort is informed by Google's efforts on its own preceding protocol. While it is not the same effort it does owe a debt to a number of Google folk. I'm not privy to all of the internal machinations at G but, at the inevitable risk of omitting an important contribution, it is worth calling out Jim Roskind, Jana Iyengar, Ian Swett, Adam Langley, and Ryan Hamilton both for their work and their willingness to evangelize and discuss it with me. Thanks! We're making a better Internet together.

This post was originally drafted as a post to the IETF HTTP Working Group by Patrick McManus ,

Thursday, March 7, 2013

Firefox Nightly on WebPageTest

Thanks to Google's amazing Pat Meenan, Firefox nightly has been added to webpagetest.org - just make sure to select the Dulles VA location to be able to use it.

He shared the news on twitter. Thanks Pat!

WPT is a gold standard tool and to be able to see it reflect (and validate/repudiate) changes in nightly is really terrific addition.

Wednesday, December 5, 2012

Smarter Network Page Load for Firefox

I just landed some interesting code for bug 792438. It should be in the December 6th nightly, and If it sticks it will be part of Firefox 20.

The new feature keeps the network clear of media transfers while a page's basic html ,css, and js are loaded. This results in a faster first paint as the elements that block initial rendering are loaded without competition. Total page load is potentially slightly regressed, but the responsiveness more than makes up for it. Similar algorithms report first paint improvements of 30% with pageload regressions around 5%. There are some cases where the effect is a lot more obvious than 30% - try pinterest.com, 163.com, www.skyrock.com, or www.tmz.com - all of these sites have large amounts of highly parallel media on the page that, in previous versions, competed with the required css.

Any claim of benefit is going to depend on bandwidth, latency, and the contents of your cache (e.g. if you've already got the js and css cached, this is moot.. or if you have a localhost connection like talos tp5 uses it is also moot because bandwidth and latency are essentially ideal there).

Before I get to the nitty gritty I think its worth a paragraph of Inside-Mozilla-Baseball to mention what a fun project this was. I say that in particular because it involved a lot of cross team participation on many different aspects (questions, advice, data, code in multiple modules, and reviews). I think we can all agree that when many different people are involved on the same effort efficiency is typically the first casualty. Perhaps this project is the exception needed to prove the rule - it went from a poorly understood use case to commited code very quickly. Ehsan, Boris, Dave Mandelin, Josh Aas, Taras, Henri, Honza Bambas, Joe Drew, and Chromium's Will Chan helped one and all - its not too often you get the rush of everyone rowing in sync; but it happened here and was awesome to behold and good for the web.

In some ways this feature is not intuitive. Web performance is generally improved by adding parallelization due to large amounts of unused bandwidth left over by the single session HTTP/1 model. In this case we are actually reducing parallelization which you wouldn't think would be a good thing. Indeed that is why it can regress total page load time for some use cases. However, parallelization is a double edged sword.

When parallelization works well it is because it helps cover up moments of idle bandwidth in a single HTTP transaction (e.g. latency in the handshake, latency during the header phase, or even latency pauses involved in growing the TCP congestion window) and this is critically important to overall performance. Servers engage in hostname sharding just to opt into dozens of parallel connections for performance reasons.

On the other hand, when it works poorly parallelization kills you in a couple of different ways. The more spectacular failure mode I'll be talking about in a different post (bug 813715) but briefly the over subscription of the link creates induced queues and TCP losses that interact badly with TCP congestion control and recovery. But the issue at hand here is more pedestrian - if you share a connection 50 ways and all 50 sessions are busy then each one gets just 2% of the bandwidth. They are inherently fair with each other and without priority even though that doesn't reflect their importance to your page.

If the required js and css is only getting 10% of the bandwidth while the images are getting 90% then the first meaningful paint is woefully delayed. The reason you do the parallelism at all is because many of those connections will be going through one of the aforementioned idle moments and aren't all simultaneously busy - so its a reasonable strategy as long as maximizing total bandwidth utilization is your goal.. but in the case of an HTML page load some resources are more important than others and it isn't worth sacrificing that ordering to perfectly optimize total page load. So this patch essentially breaks page load into two phases to sort out that problem.

The basic approach here is the same as used by Webkit's ResourceLoadScheduler. So Webkit browsers already do the basic idea (and have validated it). I decided we wanted to do this at the network layer instead of in content or layout to enable a couple extra bonuses that Webkit can't give because it operates on a higher layer:

If we know apriori that a piece of work is highly sensitive to latency but is very low in bandwidth then we can avoid holding it back even if that work is just part of an HTTP transaction. As part of this patchset I have included the ability to preconnect a small quota of 6 media TCP sessions at a time while css/js is loading. More than 6 can complete, its just limited to 6 outstanding at one instant to bound the bandwidth consumption. This results in a nice little hot pool of connections all ready for use by the image transfers when they are cleared for takeoff. You could imagine this being slightly expanded to a small number of parallel HTTP media transfers that were bandwidth throttled in the future.
The decision on holding back can be based on whether or not SPDY is in use if you wait until you have a connection to make the decision - spdy is a prioritized and muxed-on-one-tcp-session protocol that doesn't need to use this kind of workaround to do the right thing. In its case we should just send the requests as soon as possible with appropriate priorities attached to them and let the server do the right thing. The best of both worlds!

This issue illustrates again how dependent HTTP/1 is on TCP behavior and how that is a shortcoming of the protocol. Reasonable performance demands parallelism, but how much is dependent on the semantics of the content, the conditions of the network, and the size of the content. These things are essentially unknowable and even characterizations of the network and typical content change very quickly. Its essential for the health of the Internet that we migrate away from this model onto HTTP/2.

Comments over here please: https://plus.google.com/100166083286297802191

Saturday, January 7, 2012

Using SPDY for more responsive interfaces

RST_STREAM turns out to be a feature of spdy that I under appreciated for a long time. The concept is simple enough - either end of the connection can cancel an individual stream without impacting the other streams that are multiplexed on the same connection.

This fills a conceptual gap left by HTTP/1.x. - In HTTP when you want to cancel a transaction about all you can do is close the connection.

Consider the case of quickly clicking through a webmail mailbox - just scanning the contents and rapidly clicking 'next'. Typically the pages will be only partly loaded before you move on to the next one. Assuming you have used up all your parallelism in HTTP/1, the new click will either have to wait for the old transactions to complete (wasting time and bandwidth) or cancel the old ones by closing those connections and then open fresh connections for the new requests. New connections add 2 or 3 round trip times to reopen the SSL connection (you are reading email over SSL, right?) before they can be used to send the new requests. Either way - that is not a good experience.

An interactive map application has similar attributes - as you scan along the map, zooming in and out, lots of tiles are loaded and are often irrelevant before they are drawn. I'm sure you can think of other scenarios that have cancellations.

Spdy solves this simply - with its inherently much greater parallelism the new requests can be sent immediately and at the same time cancel notifications can go out for the old ones. That saves the bandwidth and gets the new requests going as fast as possible without interfering with either the established connection or any other transactions also in progress.

A page load time metric won't really show this to you but the increased responsiveness is very obvious when working with these kinds of use cases - especially under high latency conditions that make connection establishment slower.

Thursday, December 8, 2011

SPDY, Bufferbloat, HTTP, and Real-Time Networking

Long router queue sizes on the web continue to be a hot networking topic - Jim Gettys has a long interview in ACM queue. Large unmanaged queues destroy low latency applications - just ask Randell Jessup

A paper like this does a good job of showing just how bad the situation can be - experimentally driving router buffering delay from 10ms to ~1000ms on many common broadband cable and DSL modems. I wish the paper had been able to show me the range and frequency of that queue delay under normal conditions.

I'm concerned that decreasing the router buffer size, thereby increasing the drop rate, is detrimental to the current HTTP/1.x web. A classic HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't save you much - it has sent much of what it needs to send already anyhow. Unless you drop almost all of that flow from your buffers you haven't achieved much. Further, a loss event has a high chance of damaging the flow more seriously than you intended - dropping a SYN or the last packet of the data train is a packet that will have very slow retry timers, and short flows are comprised of high percentages of these kinds of packets. non-drop based loss notification like connex/ecn do less damage but are ineffective again because the short flow is more or less complete when the notification arrives so it cannot adapt the sending rate.

The problem is all of those other parallel HTTP sessions going on that didn't get the message. Its the aggregate that causes the buffer build up. Many sites commonly use 60-90 separate uncoordinated TCP flows just to load one page.

Making web transport more adaptable on this front is a big goal of my spdy work. When spdy consolidates resources onto the same tcp flow it means the remaining larger flows will be much more tcp friendly. Loss indicators will have a fighting chance of hitting the flow that can still backoff, and we won't have windows growing independently of each other. (Do you like the sound of IW=10 times 90? That's what 90 uncorrelated flows mean. IW=10 of a small number of flows otoh is excellent.). That ought to keep router queue sizes down and give things like rtcweb a fighting chance.

It also opens up the possibility of the browser identifying queue growth through delay based analysis and possibly helping the situation out by managing inside the browser our bulk tcp download rate (and definitely the upload rate) by munging the rwin or something like that. If it goes right it really shouldn't hurt throughput while giving better latency to other applications. It's all very pie in the sky and down the road, but its kind of hard to imagine in the current HTTP/1.x world.

Friday, September 23, 2011

SPDY: What I Like About You.

I've been working on implementing SPDY as an experiment in Firefox lately. We'll have to see how it plays out, but so far I really like it.

Development and benchmarking is still a work in progress, though interop seems to be complete. There are several significant to-do items left that have the potential to improve things even further. The couple of anecdotal benchmarks I have collected are broadly similar to the page load time based reports Google has shared at the IETF and velocity conf over the last few months.

tl;dr; Faster is all well and good (and I mean that!) but I'm going to make a long argument that SPDY is good for the Internet beyond faster page load times. Compared to HTTP, it is more scalable, plays nicer with other Internet traffic and brings web security forward.

SPDY: What I Like About You

#1: Infinite Parallelism with Shared Congestion Control.

You probably know that SPDY allows multiplexing of multiple HTTP resources inside one TCP stream. Unlike the related HTTP mechanisms of pipelining concurrent requests on one TCP stream, the SPDY resources can be returned in any order and even mixed together in small chunks so that head of line blocking is never an issue and you never need more than one connection to each real server. This is great for high latency environments because a resource never needs to be queued on either the client or the server for any reason other than network congestion limits.

Normal HTTP achieves transaction parallelism through parallel TCP connections. Browsers limit you to 6 parallel connections per host. Servers achieve greater parallelism by sharding their resources across a variety of host names. Often these host names are aliases for the same host, implemented explicitly to bypass the 6 connection limitation. For example, lh3.googleusercontent.com and lh4.googleusercontent.com are actually DNS CNAMEs for the same server. It is not uncommon to see performance oriented sites, like the Google properties, shard things over as many as 6 host names in order to allow 36 parallel HTTP sessions.

Parallelism is a must-have for performance. I'm looking at a trace right now that uses the afore mentioned 36 parallel HTTP sessions and its page load completes in 16.5 seconds. If I restrict it to just 1 connection per host (i.e. 6 overall), the same page takes 27.7 seconds to load. If I restrict that even further to just 1 connection in total in takes a mind numbing 94 seconds to load. And this is on 40ms RTT broadband - high latency environments such as mobile would suffer much much worse! Keep this in mind when I start saying bad things about parallel connections below, they really do great things and the web we have with them enables much more impressive applications than a single connection HTTP web ever could.

Of course using multiple parallel HTTP connections is not perfect - if they were perfect we wouldn't try to limit them to 6 at a time. There are two main problems. The first is that each connection requires a TCP handshake which incurs an extra RTT (or maybe 3 if you are using SSL) before the connection can be used. The TCP handshake is also relatively computationally hard compared to moving data (servers easily move millions of packets per second, while connection termination is generally measured in the tens of thousands), the SSL handshake even harder. Reducing the number of connections reduces this burden. But in all honesty this is becoming less of a problem over time - the cost of maintaining persistent connections is going down (which amortizes the handshake cost) and servers are getting pretty good at executing the handshakes (both SSL and vanilla) sometimes by employing the help of multi-tiered architectures for busy deployments.

The architectural problem lies in HTTP's interaction with TCP congestion control. HTTP flows are generally pretty short (a few packets per transaction), tend to stop and start a lot, and more or less play poorly with the congestion control model. The model works really well for long flows like a FTP download - that TCP stream will automatically adapt to the available bandwidth of the network and transfer at a fairly steady rate for its duration after a little bit of acclimation time. HTTP flows are generally too short to ever acclimate properly.

A SPDY flow, being the aggregation of all the parallel HTTP connections, looks to be a lot longer, busier, and more consistent than any of the individual parallel HTTP flows would be. Simply put - that makes it work better because all of that TCP congestion logic is applied to one flow instead of being repeated independently across all the parallel HTTP mini flows.

Less simply, when an idle HTTP session begins to send a response it has to guess at how much data should be put onto the wire. It does this without awareness of all the other flows. Let's say it guesses "4 packets" but there are no other active flows. In this case 4 packets is way too few and the network is under utilized and the page loads poorly. But what if 35 other flows are activated at the same time - this means 140 packets get injected into the network at the same time which is way too many. Under that scenario one of two things happen - both of them are bad:

Packet Loss. TCP reacts poorly to packet loss, especially on short flows. While 140 packets in aggregate is a pretty decent flow, remember that total transmission is made up of 35 different congestion control blocks - each one covering a packet flow of only 4 packets. A loss is devastating to performance because most of the TCP recovery strategies don't work well in that environment.
Over Buffering. This is what Jim Gettys calls bufferbloat. The giant fast moving 140 packet burst arrives at your cable modem head where the bandwidth is stepped down and most of those packets get in a long buffer to wait for their turn on your LAN. That works OK, certainly better than packet loss recovery does in practice, but the deep queue creates a giant problem for any interactive traffic that is sharing that link. Packets for those other applications (such as VOIP, gaming, video chat, etc..) now have to sit in this long queue resulting in interactive lag. Lag sucks for the Internet. The HTTP streams themselves also become non-responsive to cancel events because the only way to clear those queues is to wait them out - so clicking on a link to a new page is significantly delayed while the old page that you have already abandoned continues to consume your bandwidth.

This describes a real dilemma - if you guess more aggressive send windows then you will have a better chance of filling the network but you will also have a better chance of packet loss or over buffering. If you guess more conservative windows then loss and buffering happens less often but nothing ever runs very quickly. In the face of all those flows with independent congestion control blocks, there just isn't enough information available. (This of course is related to the famous Initial Window 10 proposal, which I support, but that's another non SPDY story.)

I'm sure you can see where this is going now. SPDY's parallelism, by virtue of being on a single TCP stream, leverages one busy shared congestion control block instead of dealing with 36 independent tiny ones. Because the stream is much busier it rarely has to guess at how much to send (you only need to guess when you're idle, SPDY is more likely to be getting active feedback), if it should drop a packet it reacts to that loss much better via the various fast recovery mechanisms of TCP, and when it is competing for bandwidth at a choke point it is much more responsive to the signals of other streams - reducing the over buffering problem.

It is for these reasons that SPDY is really exciting to me. Parallel connections work great - they work so great that it is hard to have SPDY significantly improve on the page load time of highly sharded site unless there is a very high amount of latency present. But the structural advantages of SPDY enable important efforts like RTCWeb as well as provide better network utilization and help servers scale when compared to HTTP. Even if page load times only stay at par, those other good for the Internet attributes make it worth deploying.

#2: SPDY is over SSL every time.

I greatly lament that I am late to the school of SSL-all-the-time. I spent many years trying to eek the greatest amount of server responses per watt that was possible. I looked at SSL and saw impediments. That stayed with me.

I was right about the impediments, and I've learned a lot about dealing with them, but what I didn't get was that it is simply worth the cost. As we have all seen lately, SSL isn't perfect - but having a layer of protection against an entire class of eavesdropping attacks is a property that should be able to be relied upon in every protocol as generic as HTTP. HTTP does not provide that guarantee - but SPDY does.

huzzah.

As a incentive to make the transition to SSL all the time, this makes it worth deploying by itself.

#3:Header compression.

SPDY compresses all the HTTP-equivalent headers using a specialized dictionary and a compression context that is reserved only for the headers so it does not get diluted with non-header references. This specialized scheme performs very well.

At first I really didn't think that would matter very much - but it is really a significant savings. HTTP's statelessness had its upsides, but the resulting on the wire redundancy was really over the top.

As a sample, I am looking right now at a trace of 1900 resources (that's about 40 pages). 760KB of total downstream plain text header bytes were received as 88KB compressed bytes, and upstream 949KB of plain text headers were compressed as just 65KB. I'll take 1.56MB (82%) in total overhead savings! I even have a todo item that will make this slightly better.

Tuesday, January 11, 2011

Firefox Idle Connection Selection via CWND

When choosing between more than 1 idle persistent connection FF <= 4.0 goes with a FIFO approach. I was thinking about ways to tune this.

I was going to experiment with making that a LIFO. A LIFO afterall should have better cache behavior and would allow the cache size to shrink to a more natural size as the older members remain untouched and timeout. The FIFO will basically keep the size pinned at its maximum while it cycles through all the connections which wastes system RAM with both the client and server maintaining extra idle TCP sessions. It is also the worst possible processor cache behavior. The possible argument in favor of FIFO is actually that connections are expensive to open and we've already opened these so maybe we want to keep it pinned to its maximum size just in case we need it again - it isn't obvious what to do or if it matters much.

Thinking a little further I realized that the major differentiator between these sockets is not a timestamp at all - it is the CWND associated with them on the server. Many web servers at least partially ignore the TCP suggestion to return to slow start after an RTO of idle activity so it is reasonable to assume that some of the connections have larger CWNDs than others as long as we aren't in an environment where the previous transfers have been actually bottlenecked by bandwidth - and on the web that almost never happens, RTT and document size bottleneck most transfers. Even if available bandwidth is the real bottleneck that just moots our metric, it doesn't provide any information that steers us the wrong way.

When choosing which connection to use we want to choose the one that has the largest CWND on the server. Unfortunately we cannot see directly into the congestion information on the server, but assuming that previous transfers have been bottlenecked by RTT (approximately a constant for all connections to the same server) and transfer size then we can just use the one with the history of moving the largest amount of data in one burst as a proxy for it because that burst is what opens the congestion window.

I say one burst instead of one document because we want to include pipelines of multiple HTTP transactions as long as there wasn't an RTT between them. This is another reason to want pipelines - they will more effectively open up the CWND for us. We also want to use the maximum burst seen on the connection, not the necessarily the last burst - the servers we are interested in don't shrink the CWND just because it isn't all being used for each burst.

The implementation is easy - the idleconnection list is changed from being sorted as a FIFO to being sorted according to this "maxburst" metric. The code and bugzilla are here.

Using an experiment designed to show the best case, the results are better than I expected for such a minor tweak. This was my process:

construct a base page mixed with several small and several large images plus a link to a 25KB image. There are 6 objects on the base page.
load the base page - FF4 will use six parallel connections to do it
click on the link to the 25KB image - this will use an idle persistent connection. Measure the performance of this load.

There is 250ms of RTT and several megabits of bandwidth between the client and server in my test.

As expected, vanilla FF 4.0 (beta 9) loads the target image on an idle persistent connection that was used to transfer one of the smallest images on the base page. It is FIFO afterall, and the small image load was going to finish first and be put into the persistent connection pool first.

My patched browser, using the sort by CWND algorithm, loads the target image on the idle persistent connection that was used to transfer the largest image (2MB+) from the base page.

Their history is the only difference between the two connections - at the time of requesting the 25KB image they are both connected and idle. There is a 250ms RTT between my host and the server.

The difference is huge. The default algorithm takes 3 round trips to transfer the document as it works its way through growing the congestion window. That adds up to 793ms. The sort-by-cwnd algorithm inherits a congestion window large enough for the task at hand and moves it all in one stream - just 363ms.

This is a nice tweak, but it has its limitations - by definition you cannot meaningfully multiply the gain across a large number of transactions. If you have a large number of transactions then you probably are using all your idle connections in a burst and there is no point in discriminating between them if you are just going to use them all.

I would argue if you have that much pressure on the connection pool then you are probably queueing requests and should be using pipelining. If you don't have that much pressure, then this probably helps you.

Wednesday, December 1, 2010

Performance of Pipelining in HTTP Firefox

This post provides some performance measurements of my HTTP pipeline patches for Firefox.

The key benefits of pipelining are reduced transaction latency and potentially the use of fewer connections, so those are generally the metrics I will focus on here. For each of my 5 test cases we will look at the following statistics:

The percentage of requests that are delayed in the request queue inside Firefox waiting for a connection to be available.
The average amount of queue latency for each transaction. This is measured as the time the first byte of the request is given to the kernel minus the time at which the request was presented to Necko/Gecko. It is generally very low but greater than 0 even if the transaction can be placed directly on a pipeline because it takes a moment to construct the request and perhaps schedule the socket thread if other things are on the CPU. But a high value is an opportunity lost - that is time the request could be in flight and the server could be processing it. It includes the time necessary to create a new connection if that is necessary - that includes the three way TCP handshake but not a DNS lookup (which is cached in the test).
The type of connection used for each transaction (new connection, an idle persistent connection, or a pipelined connection)
The average amount of transaction latency for each transaction. This is measured as the time of the first byte of the response being received by firefox minus the time at which the request was presented to Necko/Gecko. It is possible for the average improvement to be greater than 1 RTT because of cumulative queueing delays in the non pipelined case.
The cumulative fraction of transactions completed before 3 different elapsed times in order to show improved execution time for the test case. The benchmark times are sized appropriately for each test case.

There are 4 data points for each criteria in each test case. Because pipelining is aimed at environments with significant latency and my broadband test connectivity has below average latency for much of the world and every mobile environment the first two data points have 200ms of latency added through a traffic shaper. The two datapoints compare pipelining on vs pipelining off. The other data points measure the same things but without the induced latency.

All tests are run with both a disk and memory cache enabled but empty at the beginning of the run. In order to measure the effectiveness of the pipelining, each of these sites has been put in the "green - pipelining ok" state which is normally auto discovered.

Facebook

The first test is on Facebook. It starts by logging in and then selecting a particular user profile and navigating from that page via lists of friends, occasionally pulling up individual profiles, generating lists of recent updates, and pressing the More link on busy Facebook walls. There are approximately 1400 HTTP transactions in each test run.

The first thing to consider is the percent of requests that are delayed (i.e. queued) within firefox. I think queuing is particularly bad because if the request hasn't been passed to the network there is no way for any advances in server technology to ever operate on it. For instance - servers are prevented from returning responses out of order but nothing would prevent them from processing requests out of order in order to overlap latencies in DB queries and disk I/O if the requests were not queued on the browser side.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	79.1
Low Latency	0	78.6

That is a stark contrast. It is possible by the way to see a request being queued with pipelining enabled - a default configuration limit of 32 governs the maximum depth of the pipeline and not all request types are pipeline eligible.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	29	1630
Low Latency	6	285

You might expect to see 0ms for the pipelining case as we just illustrated above that no requests were delayed. But the queue latency covers the time from request submission to the time of putting the first byte of the request on the wire, so that includes any connection setup time when establishing a new connection. That is the primary source of the latency seen here for the pipeline enabled case.

That begs the question, when pipelining is enabled how many of the requests are pipelined?

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	2	4	2	3
Reused Idle	13	96	13	97
Pipeline	85	0	84	0

We see here a moderate reduction in the number of connections used when pipelining, but most of the effect is a transfer from idle persistent connections over to pipelines. While the percentage of new connections as a portion of the overall request stream has gone down just a tick with pipelining, the impact on the actual number of raw connections is significant - going from roughly 60 without pipelining to 30 with it enabled. That boils down to a 50% reduction in the number of connections created which is a significant provides a very busy site like Facebook a significant scalability boost.

The final criteria deal with transaction latency.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	702	1906
Low Latency	341	346

Yowza! Now there is a result. Under conditions with moderate latency the average transaction waits 1200ms less from the time the request is submitted to Necko to the time the first byte of the response header is received. The net effect is so much more than the approx ~250ms RTT because of aggregating queueing delays - without pipelining enabled you are placed in a deep queue which has to be totally cleared with a 1RTT overhead on each one before you are executed. The impact under low latency conditions is probably close to being noise.

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	97	91	75
Moderate Latency wo/Pipeline	45	38	32

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	92	61
Low Latency wo/Pipeline	99	91	61

Facebook is a big success - probably the biggest success of any of the tests. 200+ms latency situations have performance significantly increased, and low latency scenarios perform similarly while using a few less TCP connections.

Amazon.com

The Amazon.com test walks through a basic window shopping experience at Amazon.com. The home page is loaded, the kindle link is clicked, a few more categories are clicked and the lists of products are generally browsed and sorted by "hot and new" and other similar things. This boils down to about 800 HTTP transactions.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	54.4
Low Latency	0	39.8

Right away you can see that amazon queues fewer requests than Facebook, so the potential improvement is less.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	116	791
Low Latency	12	136

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	13	6	13
Reused Idle	20	96	28	87
Pipeline	68	0	66	0

The first thing to note is that less pipelining is going on that with facebook, so again there is less potential for improvement. How the pages are constructed has a lot to do with this (perhaps fewer images, etc..). But almost as interesting is the fact that the number of TCP connections (i.e. new connections) is halved in the low latency case. If the page can be transferred in the same amount of time using fewer connections that is still a win for the web overall.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	635	1083
Low Latency	266	204

An interesting result - 400ms off the average transaction in the ~250ms RTT environment, but a notable loss in the low latency scenario. All of the numbers here are averages across two test runs, but just inspecting the amazon test case in particular on some other ad-hoc runs showed quite a bit of variability. My suspicion is server load occasionally results in a single resource taking a long time to return. I have seen this disable pipelining for HTML pages, but leave it enabled for images, in the past.

Pct of Responses Rcvd in < Xms
	x=1200	x=900	x=600
Moderate Latency w/Pipeline	91	79	62
Moderate Latency wo/Pipeline	77	69	61

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	93	90	84
Low Latency wo/Pipeline	99	94	85

Flickr

The flickr test is probably the simplest of the cases. It simply loads several galleries based on set names and tags. There are roughly 350 HTTP transactions in the test. Under normal conditions Flickr has a high variability in server response time.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	57
Low Latency	0	57

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	43	814
Low Latency	7	211

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	10	27	12	27
Reused Idle	19	73	19	73
Pipeline	71	0	69	0

As with the other tests, more than half of the new connections have been replaced when pipelining is enabled.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	859	1091
Low Latency	291	366

This result is more modest, but still positive, when compared to our other tests.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	95	87	67
Moderate Latency wo/Pipeline	88	74	60

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	97	72
Low Latency wo/Pipeline	98	93	71

www.AsiaNewsPhoto.com

The test is photo journalism clearing house site located overseas and therefore the broadband low latency case has a starting RTT of closer to 100ms, while the moderate delay case adds 200ms to that. This is the smallest test case - just 175 transactions in each run.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	44
Low Latency	0	38

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	21	726
Low Latency	31	400

I am not yet certain how to explain the very modest rise in queue time for the pipeline case when the added 200ms delay is removed. It must involve an aberrant TCP connection as that is really the only component of queue time when the requests them selves are not delayed due to connection limits.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	16	7	11	8
Reused Idle	33	93	33	92
Pipeline	51	0	56	0

This is the first time we actually see the new connection numbers moving in the wrong direction. In this case I believe the type scheduling restrictions placed on the connection manager are generating new connections that may have been un-necessary in the non pipelining scenario. I'm curious if the effect would fade in a test with more transactions.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	1310	1248
Low Latency	752	689

This seems to track the changes in connection types and maybe the test bears further examination to see if an adjustment can be made. The scheduling algorithm seems to be getting in the way of itself and has made performance just a tick worse than before, though not by very much. And certainly not by enough to discount the gains made in some other scenarios.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	84	82	52
Moderate Latency wo/Pipeline	82	76	54

Pct of Responses Rcvd in < Xms
	x=1500	x=1000	x=500
Low Latency w/Pipeline	88	82	46
Low Latency wo/Pipeline	84	82	57

MapQuest

This test is different in that it is driven almost exclusively through JS and XMLHttpRequest. Those elements are present in the Facebook and Amazon tests as well, but they dominate the MapQuest test. In this scenario a map is brought up on the screen and it is manipulated in the usual ways - panning in 4 directions, zooming in and out, and toggling between satelline and map mode. By the time it is done, 711 HTTP transactions have been made.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	42
Low Latency	0	44

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	52	381
Low Latency	10	198

For all cases, the queue latency is pretty low for this test. That means the number of documents requested in one burst is relatively modest.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	16	11	14
Reused Idle	30	84	32	86
Pipeline	58	0	57	0

Marginally less new connections are used with pipelining. Hurrah.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	677	732
Low Latency	260	500

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	96	87	76
Moderate Latency wo/Pipeline	96	83	70

Pct of Responses Rcvd in < Xms
	x=900	x=600	x=300
Low Latency w/Pipeline	99	95	65
Low Latency wo/Pipeline	87	75	40

Tuesday, November 30, 2010

Implementing a Pipelined Client for Firefox

I have been working on a set of patches to revamp the Firefox HTTP pipeline implementation. The objective is to provide a fast and robust implementation of pipelines. There are lots of reasons to think this is a good thing.

The lowest level of the current implementation, which is disabled by default, receives a few minor tweaks. The primary change is to implement a continually fillable pipeline instead of the existing one where a pipeline is loaded up to whatever depth the request queue supports and then is not refilled until it has been emptied.

There are much more serious changes on top of that. The most significant is probably captured in the "Select Connection Based on Type and State" bug and patch. This code does four things: 1) it classifies each transaction into a particular type, 2) it keeps track of the history of each server with respect to pipelining, 3) it determines whether or not pipelining is appropriate based on the type of the transaction and the history of the server with that type in the past, and 4) if it decides to pipeline it places it on a pipeline filled with only transactions of the same classification.

That's a lot to think about. But the basic idea is to sort requests into control traffic (i.e. js and css), images, revalidations, htmlish things, and things known to be pipeline inappropriate such as video, most XMLHttpRequest, non-idempotent transactions, etc.. I call the latter classification "solo".

Classes of data such as CSS and revalidations that generally have very short responses, and thus benefit the most from pipelining, favor pipelines over parallel connections even when we have less than the maximum number of connections already open.

A variety of things can plague a successful pipelining session - for instance a site may have some documents with a large processing time on the server and that "think time" can really gum up the pipeline. By segregating the types of documents we can turn off pipelining for whatever is DB driven (e.g. XHTML) while still chasing down deep pipelines for images. Facebook is a terrific example of this.. the contents of your wall or your friend list probably have to be scooped out of a database computation and composed in real time, but the dozens of icons they reference are just whipped quickly right out when you have their direct key.

This concept of negative feedback (in the example above it would be a read of the HTML page with a latency well in excess of the RTT to the server) is what drives the server state. Very large responses, sub HTTP/1.1 responses, cancelled pipelines, closed connections, and server headers known to be associated with broken pipeline implementations all trigger negative feedback too.

Most feedback is tied to the classification of the particular request, but some is applied to all classes on the server. Corrupted data is one such case - if the response contains an invalid MD5 sum, or uses the proposed Assoc-Req header but fails to provide the correct information, or appears to have non-sensical framing information, all requests to that server are prevented from using pipelines for a long time. In the past such corruptions have happened due to buggy or compromised servers and intermediaries.

Much like congestion control, the determination of whether or not to proceed with a pipeline is based on both negative and positive feedback. Initially pipelines are not sent. We must first see a HTTP/1.1 response header from the server. After that has been accomplished we try and send a pipeline of only depth 2 and only on a single connection. If both of those responses are received OK then we transition from that tentative state (known internally as yellow) to a position where each connection can send pipelines of up to depth 4 instead of opening all the way. Even after the depth-of-2 test succeeds we can still not be certain the topology supports pipelines - what appears as a short pipeline at the client may not appear that way at the server due to race conditions inherent in network transfer, but the extension to a depth of 4 for every connection still represents significant additional capacity over a single connection being allowed a depth of 2. During this phase concurrent connections are of course also used, so nothing is slowed down over the historical pipelining disabled setting.

Once a transaction that was sent at at least a depth of 3 is successfully received the depth limits are removed from all connections to that host. The new maximum depth is only governed by a configuration preference with the default of 32. Should one of the negative events mentioned earlier occur, that class of transaction is placed in the red state (i.e. no pipelining is allowed for a period of time to that server) generally without interfering with the other transactions.

As mentioned, an unexpected delay of a few hundred milliseconds is considered to be think time and feedback is applied to keep things running smoothly with only a barely perceptible bump in the road. But a longer delay not only applies feedback for future use but it also cancels any transactions pipelined after the currently delayed one and moves them to new connections. This helps mitigate the head of line blocking problem if it is really severe and as a side effect no more pipelines will happen in the near future with that server. In that sense firefox is self correcting in a hostile environment.

All of this negative information expires over time but while it is still valid firefox will keep it persistently between restarts - its value was hard earned.

XMLHttpRequests are a special source of pain in traditional pipeline implementations. They often implement long polling patterns where a request is sent to the server and the server intentionally hangs until some external event occurs - it then shares that information with the client by forming a response. It is essentially server push from a data point of view implemented with a long polling request. For that reason, XHR is by default classified as solo, but meta data can be supplied in the form of an HTTP request header to allow the application to indicate the request is expected to be fulfilled quickly.

As they say on late night TV, "But Wait! There's more!" Sometimes the pipelining infrastructure problem isn't on the server side, sometimes it is a proxy that is part of the client's topology. To test for that there is a new startup test which initiates a deep pipeline to a known pipeline friendly resource on the Internet. This server intentionally defers its responses until a deep pipeline has been received there and only then leaks a small response. Both sides continue a pipeline on the same connection until it has been confirmed that an entire window of data can be supported by the HTTP devices on that path. The results of this test are cached for a very long time. This transfer also takes the opportunity to share with the client a list of host names that should be blacklisted with respect to pipelining because of known operability problems. (You can still use them - just not pipeline with them). The location (and enablement) of the test and hostname blacklist server are configurable.

All of this feedback and history tracking sounds intimidating, but the truth is that most of the web works just fine. There are enough problems that a feedback driven self correcting implementation helps smooth out the bumps, but most of the web opens right up without any problems.

Mail your comments to me or provide a talk back link.

The Value of HTTP Pipelines

For the past few months I've been on a personal quest to implement a safe and effective HTTP pipelining strategy for Mozilla Firefox. The pipeline concept is part of HTTP/1.1 but for various reasons has not been widely deployed on the web.

I'm going to make a series of three posts. This one will be basic background information - what pipelines are in the abstract and why they are more relevant to today's web than they have been in years. The second post will be about the details of my Firefox implementation and its mechanisms for dealing with the realities of the occasional pipeline-hostile piece of infrastructure. The last post will share some performance metrics of those patches in common use cases.

HTTP Pipelines - 101

A client forms a pipeline simply by sending a second HTTP request down an HTTP connection before it has received the first response. A pipeline may be more than two transactions deep. HTTP responses are still required to be returned in the order their requests arrived.

This is a simple concept, but the benefits are profound.

Chiefly, the pipelined transactions benefit by eliminating the latency involved in sending their request. The greater the latency, the greater the benefit.

This crude diagram shows 2 normal HTTP transactions without pipelining. In it each request is shown with a rightward arrow and the response is shown with a leftward arrow. The empty space represents the network latency. When the first response is received, the second is sent.

When pipelining is used, the picture becomes different. Even though the request and response sizes are the same, and the round trip time between the hosts is the same, the overall transaction completes sooner. It is obvious that the round trip latency between the client and the server has been mitigated as a problem.

The greater the latency and the deeper the pipeline the more benefit is captured.

While bandwidth continues to improve, latency is not keeping up. The trend is toward mobile networks and those have latencies 5x to 20x worse than broadband WAN connections. This increased latency is an opportunity for pipelining.

HTTP Pipelines - 201

Conceptually, parallel HTTP connections can garner similar benefits. Afterall, independent HTTP connections do not need to wait for each other to finish before beginning their work.

Up to a point, that is completely true. Parallelism has been the mainstay for many years and served us well. But it has its limits.

The chief limit is actually specified. No more than 6 (or 4, or 2 depending on the point in history) simultaneous connections are supposed to be allowed from the user agent to the server. This is a serious handicap - Firefox may easily have a request queue of over 100 images, stylesheets, and javascript items to fetch upon parsing a Facebook HTML page full of photos, emoticons, ads, and widgets. A single digit number of connections is far too small to effectively parallelize that download.

This brings us to the reason there is a limit on the number of connections. Making a new TCP connection sucks. Before it can be used at all it requires a high latency three way handshake and if it is over TLS an expensive RSA operation on the server (and yet another round trip). Making the new connection requires access to shared data structures on the server in a way that using an existing connection does not and harms scalability. Servers that can pump multiple gigabits a second of TCP data in their sleep through a few connections, can still only initiate on the order of tens of thousands of connections a second.

Maintaining all those connections is expensive for the server too. Each one takes state (i.e. RAM in the kernel, perhaps a thread stack in the application), and the server now has to deal with sorting through more control blocks and application states everytime a packet comes in in order to match it with the right one. Because of the increased diversity of TCP control blocks in use, the L2/L3 cache is severely polluted which is the number one factor in high performance TCP.

Even if pipelines only mean browser performance equal to parallel connections but accomplished using fewer connections then that is a win for Web scalability.

HTTP Pipelines - 301

The details of TCP start to weigh in heavily in favor of pipelining when parallelism and pipelining are compared.

Most significantly, TCP slow-start performs poorly in the parallelized environment. To recap what you already know: Slow start requires the sender to send a conservative amount of data at first (typically 3 or 4 packets) and then wait a full round trip time to receive positive feedback in the form of an ACK from the recipient. At that time it can grow the window a little bit and send another burst and then wait again. After enough iterations of this the sender is generating enough traffic in each burst to "fill the window" and it no longer has to wait.

Pipelining, of course, does not eliminate slow start. But it does use fewer connections to move the same amount of data and the slow-start penalty scales with the number of connections not the amount of data. Fewer connections mean less data is transferred under the artificially slow conditions of slow start.

There is another wrinkle that applies to even parallel connections that have paid their start-up dues and progressed past slow start. TCP connections that have not sent data within an RTO (think of an RTO as a round trip time plus a grace period) are supposed to go back to slow-start! From a TCP point of view this makes some sense - the inactivity means the TCP stack has lost the ack-clock that it needs to use the connection aggressively. But for even a persistent use of a parallel HTTP connection this is a disaster. Effectively each transaction response must go through slow start because it takes more than a round trip for the last packet of the previous response to travel to the client, the client to form the next request, the request to travel to the server and the server to form the response. Pipelined connections do not have that problem - one response follows immediately on the heels of the previous response which ensures optimal TCP use of the available bandwidth. Huzzah.

Lastly, it is worth talking about Google's push for larger initial windows during slow start. They support a value of 10 instead of the current 3 or 4. Basically I think this is a good thing - the Internet can support larger bursts than we are currently sending. Unfortunately, this has a terrible potential interaction with parallel connections. 6 connections would essentially mean 60 congestion control packet credits available to be sent in a burst at any time. Just as 3 is too few for a single connection environment, 60 is too many for a multiple connection environment. Whereas pipelines reduce the reliance on parallel connections, they bring down the total number of packet credits outstanding at any time while still allowing for more effective slow start capacity probing. That's a win win in my book.

Wednesday, September 1, 2010

Long Out Of Order Queueing Delays

Last week I posted about a kernel patch that records how long out of order TCP packets are kept hidden from userspace while the kernel tries to fill in the necessary holes to create an in-order stream.

These packets are especially frustrating - they have arrived at the host but the application does not have access to them until the kernel can create an in-order stream. Some applications that are really doing messaging over TCP (which might be sensible if you're looking for congestion control, maybe layered security, maybe multiplexing different semantic streams onto one TCP stream and the loss is localized to one of them, etc..) might be capable of moving on with their lives (and their data) more quickly if they had access to the missing sequence numbers. So the question is, how long are these kinds of applications waiting for in-order data when out-of-order data is already at their host?

I ran that code for about 10 days on my desktop which runs typical American broadband with normal RTTs anywhere from 40 to 150ms. The host is even SACK enabled. Here is what I found:

* 38,881 web flows (port 80 or 443)

* 164 flows that contained reordered packets. That is 4.1 per thousand.

* 915K total packets. 18,169 of them reordered. That is 19.8 per thousand.

* The flows with reordering account for just .4% of the flows, but 40% of the total packets. Obviously, the bigger you are the more likely you are to experience a reordering event.. but furthermore small flows sometimes don't reorder at all because any loss that impacts them is more likely to be repaired with an empty window and a timeout.

* If you select for just the .41% of flows that experience reordering a whopping 4.6% of the packets in that flow are reordered on average. Indeed this average flow is pretty large - 2418 packets and 110 of them are delayed due to being out of order. The average size of a flow that did not experience any reordering was just 24 data packets long. The fact that reordering events are such large clusters is probably good news - it likely means that we were seeing big windows of data on the wire and just a small amount of the early packets in that window were lost.. the rest of the window is counted as out-of-order. We want to see big windows in flight - so I'm good with that.

* The average length of a reordered packet is 1424 bytes. Over 97% of these reordered packets are at least 1300 bytes long. This isn't that interesting, but I wrote it down - so here it is.

When I talk about "N packets long" I mean my host received N packets with data in them.. bare acks and control packets are not counted.

So far, that's not too bad. Big flows have this happen all the time. Reordering is basically a pre-requisite for doing any kind of TCP fast recovery in the face of packet loss so it looks good. If we assume that the reordering is due to small packet losses which can be repaired with fast-retransmit algorithms, which seems to make sense, then the out of order problem should be repaired in a little over 1 RTT, right?

Unfortunately, when I plotted the delays incurred they look a lot bigger than the distribution of RTTs I see from my dekstop. A lot bigger.

There is a really long tail on that graph - and it only captures the best 97% of the data points. The longest I saw any packet wait in the reorder queue (and make it out again) was a full 2.5 minutes.

Even though 2.5 minutes is an aberration, the normal cases are still pretty awful. The median time out of order packets spend queued in the kernel while waiting for an in-order stream is 293 milliseconds! Ouch.

Let's zoom in on that graph - this shows the 90% of the packets that waited the shortest amount of time:

That's pretty ugly, you need to budget a full second wait to get 80% of the reordered packets out of their kernel limbo.

It is much much uglier than I expected.

It's not the reordering that bothers me - big reordering runs are to be perfectly expected in the face of a little packet loss and it is good to use that bandwidth while the loss is repaired.

But why is it taking so long to repair?

Friday, August 20, 2010

Characterizing Delays Caused by TCP in-order

If packet N+1 of a TCP flow arrives before packet N, the receiving application does not see any data until packet N gets there. That's what we mean when we say TCP guarantees in-order delivery. That is also true if N+1 through N+100 get there before N - nobody gets through until they can all be delivered in-order.

At least using the BSD socket API.

I got to thinking about the impact of this when discussing a multiplexing implementation of various logical streams on top of TCP. For instance, SPDY and BEEP do things along those lines in order to create efficiencies in terms of more accurate congestion control data. But as someone objected, that creates a certain amount of fate sharing between the different streams that wouldn't exist if they were on separate TCP channels. A packet loss in one of them creates a delay for them both even though throughput might very well be maintained using some variation of fast-retransmit and large windows.

So the question: how often are packets received but the data in them is delayed by he kernel because the stream isn't yet in order? And how long are those delays?

I don't know yet, but I wrote some crude linux kernel patches to find out. When a skb is moved out of the out of order queue a structure with 2 timestamps (in queue, out of queue) is passed to userspace through the netlink connector mechanism. It also reports the total number of received data packets on each TCP stream. That way we can find out, how often and how long.

I'm running the hack on my desktop now.

Wednesday, February 24, 2010

Forwarding Decisions with Bloom Filters

Really neat paper: "Hash, Don't Cache: fast packet forwarding for enterprise edge routers". The paper and abstract are both available on line. The authors are Minlan Yu, and Jennifer Rexford - both of Princeton.

The authors share a lament of mine - caching forwarding decisions is attractive but no longer realistic as an implementation. The diversity of addresses (including those generated randomly by attackers) pollutes the cache and sends way too much traffic to slow fallback processing paths. So we end up with routers with one class of memory and no caches. Generally their memory is made totally from the really expensive fast stuff.

But the suggestion in this paper of using hashed bloom filters instead of caches is a really cool one. Essentially maintain one filter per interface in fast memory (sram probably) and evaluate those at forwarding time in parallel if you can. You probably just get a single hit and can forward the packet onwards.. you do this with only needing enough fast ram for the hash filters (i.e. not much).. whereas all the traditional trie data and routing update code can live in slow and cheap dram.

of course, due to the false positive nature of bloomies it is possible that you'll get more than one match that is going to require some kind of (probably slower) fallback plan for that packet. But the problem is nowhere near as dire as it is with caches.. with caches the misses force all the valid entries out of the cache and then all traffic is slowed down as the cache rate drops - with bloomies only the packets with the false positives are impacted, almost all of the traffic goes through the fast path unimpacted. The paper puts the false positive rate somewhere on the order of 1 in tens of thousands (depending on a bunch of factors - but that gives a feel for it.) Furthermore a cache can be intentionally attacked with a diversity of addresses in order to flood the cache and impact service for everyone - where under bloom filters such an attack is no more or less likely to impact service than any other kind of packet.

What can't you apply a bloom filter to? It's all pretty cool. The paper has a number of other details on how to minimize false positives and efficiently process route updates.

There is a related work from the same authors on a "buffalo" architecture which, after just a quick glance, appears to apply similar principles to LAN switching.

J Rexford is also an author of Web Protocols and Practice: HTTP/1.1, Networking Protocols, Caching, and Traffic Measurement which is probably the most practical description of the major web protocol I've ever seen.

Tuesday, February 16, 2010

Googling Harder

A while back I mentioned that Google thinks TCP ought to be more aggressive.

I must admit, this matches my own bias. I can barely count the number of applications I have watched wait for network I/O when there was plenty of CPU and idle bandwidth available. It's maddening. Sometimes it's slow start or another aspect of congestion control, sometimes it is outdated things like the nagle algorithm.

Well, Google is back at it with this slide set. (PDF)

They make the argument for increasing the initial cwnd. More provocatively, they argue that the Web has already done so in a defacto way by going to aggressive numbers of independent parallel HTTP connections (where you essentially get new cwnd credits just for opening a new TCP stream). Clever argument. Maybe you want to pace the data after 3 or 4 packets based on the RTT of the handshake - so you don't overrun any buffers un-necessarily.

Frankly, this kind of thing can be implemented on the server side without ever telling the peer. It would make some sense for Google to just do this for a few different values of cwnd on a tiny fraction of their traffic and see if the packet loss rates change and then publish that.

Monday, August 10, 2009

Making the switch

Well that's funny.

Just a few hours after my last post, which suggested that virtio based networking might be getting bested by the not-in-userspace v-bus, Michael Tsirkin posts an in-kernel backend to virtio. Which puts the two on more or less the same procedural footing.

Fire up the benchmarks?

(not) switching contexts.

a lot of bits have been spilled over virtual network performance in v-bus vs virtio-net/virtio-pci.. (aka alacrity vm vs traditional kvm/qemu).. this includes some pretty sensational(ist?) performance graphs: here.

There are lots of details (and details do matter) but the first-order issue can probably be summed up thusly, from Avi Kivity on lklm:

The current conjecture is that ioq outperforms virtio because the host side of ioq is implemented in the host kernel, while the host side of virtio is implemented in userspace.

Perhaps context switching isn't such a minor detail afterall.