Bits Up!: algorithms

Showing posts with label algorithms. Show all posts

Thursday, April 6, 2017

On The Merits of QUIC for HTTP

I am often asked why the Internet HTTP community is working on an IETF based QUIC when HTTP/2 (RFC 7540) is less than 2 years old. There are good answers! This work is essentially part two of a planned evolution to improve speed, reliability, security, responsiveness, and architecture. These needs were understood when HTTP/2 was standardized but the community decided to defer larger architectural changes to the next version. That next version is arriving now in the form of QUIC.

Flashback to HTTP/2 Development

HTTP/2 development consciously constrained its own work in two ways. One was to preserve HTTP/1 semantics, and the other was to limit the work to what could be done within a traditional TLS/TCP stack.

The choice to use TLS/TCP in HTTP/2 was not a forgone conclusion. During the pre-HTTP/2 bakeoff phase there was quite a bit of side channel chatter about whether anyone would propose a different approach such as the SCTP/UDP/[D]TLS stack RTC DataChannels was contemporaneously considering for standardization.

In the end folks felt that experience and running code were top tier properties for the next revision of HTTP. That argued for TCP/TLS. Using the traditional stack lowered the risk of boiling the ocean, improved the latency of getting something deployed, and captured the meaningful low hanging fruit of multiplexing and priority that HTTP/2 was focused on.

Issues that could not be addressed with TCP/TLS 1.2 were deemed out of scope for that round of development. The mechanisms of ALPN and Alternative Services were created to facilitate successor protocols.

I think our operational experience has proven HTTP/2's choices to be a good decision. It is a big win for a lot of cases, it is now the majority HTTPS protocol, and I don't think we could have done too much better within the constraints applied.

The successor protocol turns out to be QUIC. It is a good sign for the vibrancy of HTTP/2 that the QUIC charter explicitly seeks to map HTTP/2 into its new ecosystem. QUIC is taking on exactly the items that were foreseen when scoping HTTP/2.

This post aims to highlight the benefits of QUIC as it applies to the HTTP ecosystem. I hope it is useful even for those that already understand the protocol mechanics. It, however, does not attempt to fully explain QUIC. The IETF editor's drafts or Jana's recent tutorial might be good references for that.

Fixing the TCP In-Order Penalty

The chief performance frustration with HTTP/2 happens during higher than normal packet loss. The in-order property of TCP spans multiplexed HTTP messages. A single packet loss in one message prevents subsequent unrelated messages from being delivered until the loss is repaired. This is because TCP delays received data in order to provide in-order delivery of the whole stream.

For a simple example imagine images A and B each delivered in two parts in this order: A1, A2, B1, and B2. If only A1 were to suffer a packet loss under TCP that would also delay A2, B1, and B2. While image A is unavoidably damaged by this loss, image B is also impacted even though all of its data was successfully transferred the first time.

This is something that was understood during the development of RFC 7540 and was correctly identified as a tradeoff favoring connections with lower loss rates. Our community has seen some good data on how this has played out "in the wild" recently from both Akamai and Fastly. For most HTTP/2 connections this strategy has been beneficial, but there is a tail population that actually regresses in performance compared to HTTP/1 under high levels of loss because of this.

QUIC fixes this problem through multistreaming onto one connection in a way very familiar to 7540 but with an added twist. It also gives each stream its own ordering context analagous to a TCP sequence number. These streams can be delivered independently to the application because in-order only applies to each stream instead of the whole connection in QUIC.

I believe fixing this issue is the highest impact feature of QUIC.

Starting Faster

HTTP/2 starts much faster than HTTP/1 due to its ability to send multiple requests in the first round trip and its superior connection management results in fewer connection establishments. However, new connections using the TCP/TLS stack still incur 2 or 3 round trips of delay before any HTTP/2 data can be sent.

In order to address this QUIC eschews layers in favor of a component architecture that allows sending encrypted application data immediately. QUIC still uses TLS for security establishment and has a transport connection concept, but these components are not forced into layers that require their own round trips to be initialized. Instead the transport session, the HTTP requests, and the TLS context are all combined into the first packet flight when doing session resumption (i.e. you are returning to a server you have seen before). A key part of this is integration with TLS 1.3 and in particular the 0-RTT (aka Early Data) handshake feature.

The HTTP/2 world, given enough time, will be able to capture some of the same benefits using both TLS 1.3 and TCP Fast Open. Some of that work is made impractical by dependencies on Operating System configurations and the occasional interference from middleboxes unhappy with TCP extensions.

However, even at full deployment of TLS 1.3 and TCP Fast Open, that approach will lag QUIC performance because QUIC can utilize the full flight of data in the first round trip while Fast Open limits the amount of data that can be carried to the roughly 1460 bytes available in a single TCP SYN packet. That packet also needs to include the TLS Client Hello and HTTP SETTINGS information along with any HTTP requests. That single packet runs out of room quickly if you need to encode more than one request or any message body. Any excess needs to wait a round trip.

Harmonizing with TLS

When used with HTTP/1 and HTTP/2, TLS generally operates as a simple pipe. During encryption cleartext streams of bytes go in one side and a stream of encrypted bytes come out the other and are then fed to TCP. The reverse happens when decrypting. Unfortunately, the TLS layer operates internally on multi-byte records instead of a byte stream and the mismatch creates a significant performance problem.

The records can be up to 64KB and a wide variety of sizes are used in practice. In order to enforce data integrity, one of the fundamental security properties of TLS, the entire record must be received before it can be decoded. When the record spans multiple packets a problem similar to the "TCP in-order penalty" discussed earlier appears.

A loss to any packet in the record delays decoding and delivery of the other correctly delivered packets while the loss is repaired. In this case the problem is actually a bit worse as any loss impacts the whole record not just the portion of the stream following the loss. Further, because application delivery of the first byte of the record is always dependent on the receipt of the last byte of the record simple serialization delays or common TCP congestion-control stalls add latency to application delivery even with 0% packet loss.

The 'obvious fix' of placing an independent record in each packet turns out to work much better with QUIC than TCP. This is because TCP's API is a simple byte stream. Applications, including TLS, have no sense of where packets begin or end and have no reliable control over it. Furthermore, TCP proxies or even HTTP proxies commonly rearrange TCP packets while leaving the byte stream in tact (a proof of the practical value of end to end integrity protection!).

Even the absurd-um solution of 1 byte records does not work because the record overhead creates multibyte sequences that will still span packet boundaries. Such a naive approach would also drown in its own overhead.

QUIC shines here by using its component architecture rather than the traditional layers. The QUIC transport layer receives plaintext from the application and consults its own transport information regarding packet numbers, PMTU, and the TLS keying information. It combines all of this to form the encrypted packets that can be decrypted atomically with the equivalent of one record per UDP packet. Intermediaries are unable to mess with the framing because even the transport layer is integrity protected in QUIC during this same process - a significant security bonus! Any loss events will only impact delivery of the lost packet.

No More TCP RST Data Loss

As many HTTP developers will tell you, TCP RST is one of the most painful parts of the existing ecosystem. Its pain comes in many forms, but the data loss is the worst.

The circumstances for an operating system generating a RST and how they respond to them can vary by implementation. One common scenario is a server close()ing a connection that has received another request that the HTTP server has not yet read and is unaware of. This is a routine case for HTTP/1 and HTTP/2 applications. Most kernels will react to the closing of a socket with unconsumed data by sending a RST to the peer.

That RST will be processed out of order when received. In practice this means if the original client does a recv() to consume ordinary data that was sent by the server before the server invoked close() the client will incur an unrecoverable failure if the RST has also already arrived and that data cannot ever be read. This is true even if the kernel has sent a TCP ack for it! The problem gets worse when combined with larger TLS record sizes as often the last bit of data is what is needed to decode the whole record and substantial data loss of up to 64KB occurs.

The QUIC RST equivalent is not part of the orderly shutdown of application streams and it is not expected to ever force the loss of already acknowledged data.

Better Responsiveness through Buffer Management

The primary goal of HTTP/2 was the introduction of multiplexing into a single connection and it was understood that you cannot have meaningful multiplexing without also introducing a priority scheme. HTTP/1 illustrates the problem well - it multiplexes the path through unprioritized TCP parallelism which routinely gives poor results. The final RFC contained both multiplexing and priority mechanisms which for the most part work well.

However, successful prioritization requires you to buffer before serializing the byte stream into TLS and TCP because once sent to TCP those messages cannot be reordered in the case of higher priority data presenting itself. Unfortunately high latency TCP, requires a significant amount of buffering at the socket layer in order to run as fast as possible. These two competing interests make it difficult to judge how much buffering an HTTP/2 sender should use. While there are some Operating System specific oracles that give some clues, TCP itself does not provide any useful guidance to the application for reasonably sizing its socket buffers.

This combination has made it challenging for applications to determine the appropriate level of socket buffering and in turn they sometimes have overbuffered in order to make TCP run at line rate. This results in poor responsiveness to the priority schedule and the inability for a server to recognize individual streams being canceled (which happens more than you may think) because they have already been buffered.

The blending of transport and application components creates the opportunity for QUIC implementations to do a better job on priority. They do this by buffering application data with its priority information outside of the transmission layer. This allows the late binding of the packet transmission to the data that is highest priority at that moment.

Relatedly, whenever a retransmission is required QUIC retransmits the original data in one or more new packets (with new packet identifiers) instead of retransmitting a copy of the lost packet as TCP does. This creates an opportunity to reprioritize, or even drop canceled streams, during retransmission. This compares favorably to TCP which is sentenced to retransmitting the oldest (and perhaps now irrelevant) data first due to its single sequence number and in-order properties.

UDP means Universal DePloyment

QUIC is not inherently either a user space or kernel space protocol - it is quite possible to deploy it in either configuration. However, UDP based applications are often deployed in userspace configurations and do not require special configurations or permissions to run there. It is fair to expect a number of user space based QUIC implementations.

Time will tell exactly what that looks like, but I anticipate it will be a combination of self-updating evergreen applications such as web servers and browsers and also a small set of well maintained libraries akin to the role openssl plays in distributions.

Decoupling functionality traditionally performed by TCP from the operating system creates an opportunity for deploying software faster, updating it more regularly, and iterating on its algorithms in a tight loop. The long replacement and maintenance schedules of operating systems, sometimes measured in decades, create barriers to deploying new approaches to networking.

This new freedom applies both QUIC itself, but also to some of its embedded techniques that have equivalents in the TCP universe that have traditionally been difficult to deploy. Thanks to user space distribution packet pacing, fast open, and loss discovery improvements like RACK will see greater deployment than ever before.

Userspace will mean faster evolution in networking and greater distribution of the resulting best practices.

---

The IETF QUIC effort is informed by Google's efforts on its own preceding protocol. While it is not the same effort it does owe a debt to a number of Google folk. I'm not privy to all of the internal machinations at G but, at the inevitable risk of omitting an important contribution, it is worth calling out Jim Roskind, Jana Iyengar, Ian Swett, Adam Langley, and Ryan Hamilton both for their work and their willingness to evangelize and discuss it with me. Thanks! We're making a better Internet together.

This post was originally drafted as a post to the IETF HTTP Working Group by Patrick McManus ,

Wednesday, December 5, 2012

Smarter Network Page Load for Firefox

I just landed some interesting code for bug 792438. It should be in the December 6th nightly, and If it sticks it will be part of Firefox 20.

The new feature keeps the network clear of media transfers while a page's basic html ,css, and js are loaded. This results in a faster first paint as the elements that block initial rendering are loaded without competition. Total page load is potentially slightly regressed, but the responsiveness more than makes up for it. Similar algorithms report first paint improvements of 30% with pageload regressions around 5%. There are some cases where the effect is a lot more obvious than 30% - try pinterest.com, 163.com, www.skyrock.com, or www.tmz.com - all of these sites have large amounts of highly parallel media on the page that, in previous versions, competed with the required css.

Any claim of benefit is going to depend on bandwidth, latency, and the contents of your cache (e.g. if you've already got the js and css cached, this is moot.. or if you have a localhost connection like talos tp5 uses it is also moot because bandwidth and latency are essentially ideal there).

Before I get to the nitty gritty I think its worth a paragraph of Inside-Mozilla-Baseball to mention what a fun project this was. I say that in particular because it involved a lot of cross team participation on many different aspects (questions, advice, data, code in multiple modules, and reviews). I think we can all agree that when many different people are involved on the same effort efficiency is typically the first casualty. Perhaps this project is the exception needed to prove the rule - it went from a poorly understood use case to commited code very quickly. Ehsan, Boris, Dave Mandelin, Josh Aas, Taras, Henri, Honza Bambas, Joe Drew, and Chromium's Will Chan helped one and all - its not too often you get the rush of everyone rowing in sync; but it happened here and was awesome to behold and good for the web.

In some ways this feature is not intuitive. Web performance is generally improved by adding parallelization due to large amounts of unused bandwidth left over by the single session HTTP/1 model. In this case we are actually reducing parallelization which you wouldn't think would be a good thing. Indeed that is why it can regress total page load time for some use cases. However, parallelization is a double edged sword.

When parallelization works well it is because it helps cover up moments of idle bandwidth in a single HTTP transaction (e.g. latency in the handshake, latency during the header phase, or even latency pauses involved in growing the TCP congestion window) and this is critically important to overall performance. Servers engage in hostname sharding just to opt into dozens of parallel connections for performance reasons.

On the other hand, when it works poorly parallelization kills you in a couple of different ways. The more spectacular failure mode I'll be talking about in a different post (bug 813715) but briefly the over subscription of the link creates induced queues and TCP losses that interact badly with TCP congestion control and recovery. But the issue at hand here is more pedestrian - if you share a connection 50 ways and all 50 sessions are busy then each one gets just 2% of the bandwidth. They are inherently fair with each other and without priority even though that doesn't reflect their importance to your page.

If the required js and css is only getting 10% of the bandwidth while the images are getting 90% then the first meaningful paint is woefully delayed. The reason you do the parallelism at all is because many of those connections will be going through one of the aforementioned idle moments and aren't all simultaneously busy - so its a reasonable strategy as long as maximizing total bandwidth utilization is your goal.. but in the case of an HTML page load some resources are more important than others and it isn't worth sacrificing that ordering to perfectly optimize total page load. So this patch essentially breaks page load into two phases to sort out that problem.

The basic approach here is the same as used by Webkit's ResourceLoadScheduler. So Webkit browsers already do the basic idea (and have validated it). I decided we wanted to do this at the network layer instead of in content or layout to enable a couple extra bonuses that Webkit can't give because it operates on a higher layer:

If we know apriori that a piece of work is highly sensitive to latency but is very low in bandwidth then we can avoid holding it back even if that work is just part of an HTTP transaction. As part of this patchset I have included the ability to preconnect a small quota of 6 media TCP sessions at a time while css/js is loading. More than 6 can complete, its just limited to 6 outstanding at one instant to bound the bandwidth consumption. This results in a nice little hot pool of connections all ready for use by the image transfers when they are cleared for takeoff. You could imagine this being slightly expanded to a small number of parallel HTTP media transfers that were bandwidth throttled in the future.
The decision on holding back can be based on whether or not SPDY is in use if you wait until you have a connection to make the decision - spdy is a prioritized and muxed-on-one-tcp-session protocol that doesn't need to use this kind of workaround to do the right thing. In its case we should just send the requests as soon as possible with appropriate priorities attached to them and let the server do the right thing. The best of both worlds!

This issue illustrates again how dependent HTTP/1 is on TCP behavior and how that is a shortcoming of the protocol. Reasonable performance demands parallelism, but how much is dependent on the semantics of the content, the conditions of the network, and the size of the content. These things are essentially unknowable and even characterizations of the network and typical content change very quickly. Its essential for the health of the Internet that we migrate away from this model onto HTTP/2.

Comments over here please: https://plus.google.com/100166083286297802191

Thursday, December 8, 2011

SPDY, Bufferbloat, HTTP, and Real-Time Networking

Long router queue sizes on the web continue to be a hot networking topic - Jim Gettys has a long interview in ACM queue. Large unmanaged queues destroy low latency applications - just ask Randell Jessup

A paper like this does a good job of showing just how bad the situation can be - experimentally driving router buffering delay from 10ms to ~1000ms on many common broadband cable and DSL modems. I wish the paper had been able to show me the range and frequency of that queue delay under normal conditions.

I'm concerned that decreasing the router buffer size, thereby increasing the drop rate, is detrimental to the current HTTP/1.x web. A classic HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't save you much - it has sent much of what it needs to send already anyhow. Unless you drop almost all of that flow from your buffers you haven't achieved much. Further, a loss event has a high chance of damaging the flow more seriously than you intended - dropping a SYN or the last packet of the data train is a packet that will have very slow retry timers, and short flows are comprised of high percentages of these kinds of packets. non-drop based loss notification like connex/ecn do less damage but are ineffective again because the short flow is more or less complete when the notification arrives so it cannot adapt the sending rate.

The problem is all of those other parallel HTTP sessions going on that didn't get the message. Its the aggregate that causes the buffer build up. Many sites commonly use 60-90 separate uncoordinated TCP flows just to load one page.

Making web transport more adaptable on this front is a big goal of my spdy work. When spdy consolidates resources onto the same tcp flow it means the remaining larger flows will be much more tcp friendly. Loss indicators will have a fighting chance of hitting the flow that can still backoff, and we won't have windows growing independently of each other. (Do you like the sound of IW=10 times 90? That's what 90 uncorrelated flows mean. IW=10 of a small number of flows otoh is excellent.). That ought to keep router queue sizes down and give things like rtcweb a fighting chance.

It also opens up the possibility of the browser identifying queue growth through delay based analysis and possibly helping the situation out by managing inside the browser our bulk tcp download rate (and definitely the upload rate) by munging the rwin or something like that. If it goes right it really shouldn't hurt throughput while giving better latency to other applications. It's all very pie in the sky and down the road, but its kind of hard to imagine in the current HTTP/1.x world.

Wednesday, February 2, 2011

HTTP Parallel Connections (Firefox edition!)

Parallelism helps when

It hides network idleness during TCP Handshakes though persistent connections help with this too.
It hides network idleness during the first byte phase transactions, though pipelining can address this too.
It hides network idleness during TCP slow start wait-for-ack periods. This is a big one.
It provides a mechanism to prioritize and avoid head of line blocking problems.
It steals bandwidth from competing "tcp friendly" flows by simply increasing the number of flows in one application. That's an arms race that most people think should be avoided.

Parallelism hurts when

It increases the number of TCP Handshakes which are both slow and CPU intensive (at least compared to regular data packets) to execute - this assumes persistent connections are an alternative.
It increases the overhead of normal data processing because more flows have to be considered typically via longer hash chains
It increases the impact of memory overhead and processor cache pollution by increasing the number of simultaneous TCP control blocks that have to managed on both the client and the server.
The resulting reduced amount of data per flow makes it harder to fully open sender congestion windows.
Packet loss is increased due to the non correlated fluctuations of data to be sent between the parallel connections. Two competing flows that are both sending from infinite data sources will quickly adapt to share the bandwidth, but two flows that have a fluctuating demand (e.g. parallel persistent HTTP connections that periodically go idle and alive) will inherently have patterns of underutilizing and overutilizing the path. Overutilization results in either packet loss or excess buffering in the network, which leads to poor interactive response times.

When should we open a new parallel connection?

when I don't have an idle connection and I need the answer with minimum latency
when I expect existing connections are experiencing idleness and therefore not using all of the available bandwidth

The approach HTTP implementations, including firefox, take to solving this quandary? They crudely enforce a constant number of connections per host and open them until they hit that limit. Variously across time that limit has commonly been 2, 4, and 6. As server technology has evolved to the point many years ago where the impact of idleness was a bigger deal than the CPU overhead on the server we saw servers actually publish their resources under several virtual host names, even though it was all the same server, for the exclusive purpose of circumenting that per-host limit in the client.

I wonder if we can't do better in Firefox.. First, lets deal with the case of a low latency request. Right now all we do with them is to put them at the top of the waiting queue if the request cannot be dispatched immediately (because the limit of 6 has already been reached). But there are really two cases to consider:

What to do when the network is not already saturated
What to do when the network is saturated

In both cases the first step for a truly low latency request is the same - open a new connection assuming there isn't an idle one available. However, note that establishing that connection is going to take at least 1 RTT for normal HTTP and 2 RTT for HTTPs - so we should actually watch for any existing HTTP transactions to complete on a different reusable persistent connection in between the time we start opening the new connection and the time the handshake is complete. If that happens the persistent connection should be used instead - that will require a change in the current logic where nsHttpConnection opens the sockets after it has been assigned a transaction. Instead nsHttpConnectionMgr should be opening the sockets as well as receiving the returned persistent connections and then should dispatch to them as they become available.

In the case of a saturated network some of the existing parallel connections should be stalled while the low latency request is satisfied in order to provide the most bandwidth for that important transaction. We can do this by temporarily slamming their recv windows to something close to 1 packet of data which will slow them down to a trickle. This can be done commensurately with the transmission of the prioritized request as it should take 1/2 RTT for the window change to reach the sender.

But what about the more common case where all transactions are of equal priority - how do we make the decision then about opening a new connection vs queueing a new transaction? Assuming we aren't concerned about head of line blocking issues (which we should be able to wrap up in a definition of priorty somehow), then we want to do this only when there is network idleness that can be covered up by parallelism. This approach is radically different than "open up to N" connections.

It isn't obvious exactly how to determine that in Necko. But then again, you are looking for data bursts followed by idleness - and its pretty obvious when you see it graphed out. This is the transfer pattern of a single http response I looked at a couple of weeks ago - it could happily overlap with another flow in order to more effectively utilize the whole pipe. (of course, if the server used a larger initial CWND, the problem would be massively reduced.)

Separating HTTP Connections from TCP Connections

The firefox http connection implementation, and most others I have seen, binds the http connection and the tcp connection together 1 to 1 something like this:

Check pconn pool for idle http connection
if that succeeds, dispatch
if limits allow, open a tcp connection and when that completes dispatch on it
otherwise queue it and goto 1 when an existing transaction completes

As part of the refactoring of bugzilla 623948 I have separated this logic out a little bit down to its tcp roots:

Check pconn pool for idle http connection
if that succeeds, dispatch
queue transaction
if limits allow, start a tcp connection which will be added to the pconn pool when it completes
whenever a pconn is available or a transaction completes check queue for dispatch

The major difference is that when a transaction in the first scenario decides to open an http connection it always waits for that connection to complete and then dispatches on it. In the second scenario the two actions are taken independently, if another connection frees up before the newly demanded TCP connection is ready we can use that instead (and then cache the connection when it does complete in the pconn pool).

It turns out this happens, anecdotally, a whole freaking lot. My test network has a delay of about 250ms between Firefox and each server.

Loading the NY Times required 219 TCP connections of which 141 (64%) were able to be served on a different persistent connection that became available between the time we started to open the connection and the time the handshake actually completed.

Loading the wall of one of my facebook pals saw this behavior on 9 of 27 connections (33%), and the cnn home page performed similarly to NYT - 76/117 (65%).

This makes some intuitive sense - if you have a piece of HTML that includes an img it is entirely likely that we will start the request for the img before we have finished transferring the HTML, but the HTML will finish transferring before the new handshake (which will take a full RTT) completes. This algorithm just moves the request for the img over to the HTML persistent connection which can proceed as soon as the HTML is done.

The amount of latency saved is variable - indeed it is probably at least somewhat uniformly random with 0 and RTT as its bounds. I was seeing a mean of around 80ms on each one. Of course this doesn't apply to all transactions - just ones that are opening TCP connections to servers that also do persistent connections.

but its still cool.

Tuesday, January 11, 2011

Firefox Idle Connection Selection via CWND

When choosing between more than 1 idle persistent connection FF <= 4.0 goes with a FIFO approach. I was thinking about ways to tune this.

I was going to experiment with making that a LIFO. A LIFO afterall should have better cache behavior and would allow the cache size to shrink to a more natural size as the older members remain untouched and timeout. The FIFO will basically keep the size pinned at its maximum while it cycles through all the connections which wastes system RAM with both the client and server maintaining extra idle TCP sessions. It is also the worst possible processor cache behavior. The possible argument in favor of FIFO is actually that connections are expensive to open and we've already opened these so maybe we want to keep it pinned to its maximum size just in case we need it again - it isn't obvious what to do or if it matters much.

Thinking a little further I realized that the major differentiator between these sockets is not a timestamp at all - it is the CWND associated with them on the server. Many web servers at least partially ignore the TCP suggestion to return to slow start after an RTO of idle activity so it is reasonable to assume that some of the connections have larger CWNDs than others as long as we aren't in an environment where the previous transfers have been actually bottlenecked by bandwidth - and on the web that almost never happens, RTT and document size bottleneck most transfers. Even if available bandwidth is the real bottleneck that just moots our metric, it doesn't provide any information that steers us the wrong way.

When choosing which connection to use we want to choose the one that has the largest CWND on the server. Unfortunately we cannot see directly into the congestion information on the server, but assuming that previous transfers have been bottlenecked by RTT (approximately a constant for all connections to the same server) and transfer size then we can just use the one with the history of moving the largest amount of data in one burst as a proxy for it because that burst is what opens the congestion window.

I say one burst instead of one document because we want to include pipelines of multiple HTTP transactions as long as there wasn't an RTT between them. This is another reason to want pipelines - they will more effectively open up the CWND for us. We also want to use the maximum burst seen on the connection, not the necessarily the last burst - the servers we are interested in don't shrink the CWND just because it isn't all being used for each burst.

The implementation is easy - the idleconnection list is changed from being sorted as a FIFO to being sorted according to this "maxburst" metric. The code and bugzilla are here.

Using an experiment designed to show the best case, the results are better than I expected for such a minor tweak. This was my process:

construct a base page mixed with several small and several large images plus a link to a 25KB image. There are 6 objects on the base page.
load the base page - FF4 will use six parallel connections to do it
click on the link to the 25KB image - this will use an idle persistent connection. Measure the performance of this load.

There is 250ms of RTT and several megabits of bandwidth between the client and server in my test.

As expected, vanilla FF 4.0 (beta 9) loads the target image on an idle persistent connection that was used to transfer one of the smallest images on the base page. It is FIFO afterall, and the small image load was going to finish first and be put into the persistent connection pool first.

My patched browser, using the sort by CWND algorithm, loads the target image on the idle persistent connection that was used to transfer the largest image (2MB+) from the base page.

Their history is the only difference between the two connections - at the time of requesting the 25KB image they are both connected and idle. There is a 250ms RTT between my host and the server.

The difference is huge. The default algorithm takes 3 round trips to transfer the document as it works its way through growing the congestion window. That adds up to 793ms. The sort-by-cwnd algorithm inherits a congestion window large enough for the task at hand and moves it all in one stream - just 363ms.

This is a nice tweak, but it has its limitations - by definition you cannot meaningfully multiply the gain across a large number of transactions. If you have a large number of transactions then you probably are using all your idle connections in a burst and there is no point in discriminating between them if you are just going to use them all.

I would argue if you have that much pressure on the connection pool then you are probably queueing requests and should be using pipelining. If you don't have that much pressure, then this probably helps you.

Wednesday, December 1, 2010

Performance of Pipelining in HTTP Firefox

This post provides some performance measurements of my HTTP pipeline patches for Firefox.

The key benefits of pipelining are reduced transaction latency and potentially the use of fewer connections, so those are generally the metrics I will focus on here. For each of my 5 test cases we will look at the following statistics:

The percentage of requests that are delayed in the request queue inside Firefox waiting for a connection to be available.
The average amount of queue latency for each transaction. This is measured as the time the first byte of the request is given to the kernel minus the time at which the request was presented to Necko/Gecko. It is generally very low but greater than 0 even if the transaction can be placed directly on a pipeline because it takes a moment to construct the request and perhaps schedule the socket thread if other things are on the CPU. But a high value is an opportunity lost - that is time the request could be in flight and the server could be processing it. It includes the time necessary to create a new connection if that is necessary - that includes the three way TCP handshake but not a DNS lookup (which is cached in the test).
The type of connection used for each transaction (new connection, an idle persistent connection, or a pipelined connection)
The average amount of transaction latency for each transaction. This is measured as the time of the first byte of the response being received by firefox minus the time at which the request was presented to Necko/Gecko. It is possible for the average improvement to be greater than 1 RTT because of cumulative queueing delays in the non pipelined case.
The cumulative fraction of transactions completed before 3 different elapsed times in order to show improved execution time for the test case. The benchmark times are sized appropriately for each test case.

There are 4 data points for each criteria in each test case. Because pipelining is aimed at environments with significant latency and my broadband test connectivity has below average latency for much of the world and every mobile environment the first two data points have 200ms of latency added through a traffic shaper. The two datapoints compare pipelining on vs pipelining off. The other data points measure the same things but without the induced latency.

All tests are run with both a disk and memory cache enabled but empty at the beginning of the run. In order to measure the effectiveness of the pipelining, each of these sites has been put in the "green - pipelining ok" state which is normally auto discovered.

Facebook

The first test is on Facebook. It starts by logging in and then selecting a particular user profile and navigating from that page via lists of friends, occasionally pulling up individual profiles, generating lists of recent updates, and pressing the More link on busy Facebook walls. There are approximately 1400 HTTP transactions in each test run.

The first thing to consider is the percent of requests that are delayed (i.e. queued) within firefox. I think queuing is particularly bad because if the request hasn't been passed to the network there is no way for any advances in server technology to ever operate on it. For instance - servers are prevented from returning responses out of order but nothing would prevent them from processing requests out of order in order to overlap latencies in DB queries and disk I/O if the requests were not queued on the browser side.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	79.1
Low Latency	0	78.6

That is a stark contrast. It is possible by the way to see a request being queued with pipelining enabled - a default configuration limit of 32 governs the maximum depth of the pipeline and not all request types are pipeline eligible.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	29	1630
Low Latency	6	285

You might expect to see 0ms for the pipelining case as we just illustrated above that no requests were delayed. But the queue latency covers the time from request submission to the time of putting the first byte of the request on the wire, so that includes any connection setup time when establishing a new connection. That is the primary source of the latency seen here for the pipeline enabled case.

That begs the question, when pipelining is enabled how many of the requests are pipelined?

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	2	4	2	3
Reused Idle	13	96	13	97
Pipeline	85	0	84	0

We see here a moderate reduction in the number of connections used when pipelining, but most of the effect is a transfer from idle persistent connections over to pipelines. While the percentage of new connections as a portion of the overall request stream has gone down just a tick with pipelining, the impact on the actual number of raw connections is significant - going from roughly 60 without pipelining to 30 with it enabled. That boils down to a 50% reduction in the number of connections created which is a significant provides a very busy site like Facebook a significant scalability boost.

The final criteria deal with transaction latency.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	702	1906
Low Latency	341	346

Yowza! Now there is a result. Under conditions with moderate latency the average transaction waits 1200ms less from the time the request is submitted to Necko to the time the first byte of the response header is received. The net effect is so much more than the approx ~250ms RTT because of aggregating queueing delays - without pipelining enabled you are placed in a deep queue which has to be totally cleared with a 1RTT overhead on each one before you are executed. The impact under low latency conditions is probably close to being noise.

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	97	91	75
Moderate Latency wo/Pipeline	45	38	32

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	92	61
Low Latency wo/Pipeline	99	91	61

Facebook is a big success - probably the biggest success of any of the tests. 200+ms latency situations have performance significantly increased, and low latency scenarios perform similarly while using a few less TCP connections.

Amazon.com

The Amazon.com test walks through a basic window shopping experience at Amazon.com. The home page is loaded, the kindle link is clicked, a few more categories are clicked and the lists of products are generally browsed and sorted by "hot and new" and other similar things. This boils down to about 800 HTTP transactions.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	54.4
Low Latency	0	39.8

Right away you can see that amazon queues fewer requests than Facebook, so the potential improvement is less.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	116	791
Low Latency	12	136

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	13	6	13
Reused Idle	20	96	28	87
Pipeline	68	0	66	0

The first thing to note is that less pipelining is going on that with facebook, so again there is less potential for improvement. How the pages are constructed has a lot to do with this (perhaps fewer images, etc..). But almost as interesting is the fact that the number of TCP connections (i.e. new connections) is halved in the low latency case. If the page can be transferred in the same amount of time using fewer connections that is still a win for the web overall.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	635	1083
Low Latency	266	204

An interesting result - 400ms off the average transaction in the ~250ms RTT environment, but a notable loss in the low latency scenario. All of the numbers here are averages across two test runs, but just inspecting the amazon test case in particular on some other ad-hoc runs showed quite a bit of variability. My suspicion is server load occasionally results in a single resource taking a long time to return. I have seen this disable pipelining for HTML pages, but leave it enabled for images, in the past.

Pct of Responses Rcvd in < Xms
	x=1200	x=900	x=600
Moderate Latency w/Pipeline	91	79	62
Moderate Latency wo/Pipeline	77	69	61

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	93	90	84
Low Latency wo/Pipeline	99	94	85

Flickr

The flickr test is probably the simplest of the cases. It simply loads several galleries based on set names and tags. There are roughly 350 HTTP transactions in the test. Under normal conditions Flickr has a high variability in server response time.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	57
Low Latency	0	57

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	43	814
Low Latency	7	211

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	10	27	12	27
Reused Idle	19	73	19	73
Pipeline	71	0	69	0

As with the other tests, more than half of the new connections have been replaced when pipelining is enabled.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	859	1091
Low Latency	291	366

This result is more modest, but still positive, when compared to our other tests.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	95	87	67
Moderate Latency wo/Pipeline	88	74	60

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	97	72
Low Latency wo/Pipeline	98	93	71

www.AsiaNewsPhoto.com

The test is photo journalism clearing house site located overseas and therefore the broadband low latency case has a starting RTT of closer to 100ms, while the moderate delay case adds 200ms to that. This is the smallest test case - just 175 transactions in each run.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	44
Low Latency	0	38

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	21	726
Low Latency	31	400

I am not yet certain how to explain the very modest rise in queue time for the pipeline case when the added 200ms delay is removed. It must involve an aberrant TCP connection as that is really the only component of queue time when the requests them selves are not delayed due to connection limits.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	16	7	11	8
Reused Idle	33	93	33	92
Pipeline	51	0	56	0

This is the first time we actually see the new connection numbers moving in the wrong direction. In this case I believe the type scheduling restrictions placed on the connection manager are generating new connections that may have been un-necessary in the non pipelining scenario. I'm curious if the effect would fade in a test with more transactions.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	1310	1248
Low Latency	752	689

This seems to track the changes in connection types and maybe the test bears further examination to see if an adjustment can be made. The scheduling algorithm seems to be getting in the way of itself and has made performance just a tick worse than before, though not by very much. And certainly not by enough to discount the gains made in some other scenarios.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	84	82	52
Moderate Latency wo/Pipeline	82	76	54

Pct of Responses Rcvd in < Xms
	x=1500	x=1000	x=500
Low Latency w/Pipeline	88	82	46
Low Latency wo/Pipeline	84	82	57

MapQuest

This test is different in that it is driven almost exclusively through JS and XMLHttpRequest. Those elements are present in the Facebook and Amazon tests as well, but they dominate the MapQuest test. In this scenario a map is brought up on the screen and it is manipulated in the usual ways - panning in 4 directions, zooming in and out, and toggling between satelline and map mode. By the time it is done, 711 HTTP transactions have been made.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	42
Low Latency	0	44

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	52	381
Low Latency	10	198

For all cases, the queue latency is pretty low for this test. That means the number of documents requested in one burst is relatively modest.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	16	11	14
Reused Idle	30	84	32	86
Pipeline	58	0	57	0

Marginally less new connections are used with pipelining. Hurrah.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	677	732
Low Latency	260	500

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	96	87	76
Moderate Latency wo/Pipeline	96	83	70

Pct of Responses Rcvd in < Xms
	x=900	x=600	x=300
Low Latency w/Pipeline	99	95	65
Low Latency wo/Pipeline	87	75	40

Tuesday, November 30, 2010

Implementing a Pipelined Client for Firefox

I have been working on a set of patches to revamp the Firefox HTTP pipeline implementation. The objective is to provide a fast and robust implementation of pipelines. There are lots of reasons to think this is a good thing.

The lowest level of the current implementation, which is disabled by default, receives a few minor tweaks. The primary change is to implement a continually fillable pipeline instead of the existing one where a pipeline is loaded up to whatever depth the request queue supports and then is not refilled until it has been emptied.

There are much more serious changes on top of that. The most significant is probably captured in the "Select Connection Based on Type and State" bug and patch. This code does four things: 1) it classifies each transaction into a particular type, 2) it keeps track of the history of each server with respect to pipelining, 3) it determines whether or not pipelining is appropriate based on the type of the transaction and the history of the server with that type in the past, and 4) if it decides to pipeline it places it on a pipeline filled with only transactions of the same classification.

That's a lot to think about. But the basic idea is to sort requests into control traffic (i.e. js and css), images, revalidations, htmlish things, and things known to be pipeline inappropriate such as video, most XMLHttpRequest, non-idempotent transactions, etc.. I call the latter classification "solo".

Classes of data such as CSS and revalidations that generally have very short responses, and thus benefit the most from pipelining, favor pipelines over parallel connections even when we have less than the maximum number of connections already open.

A variety of things can plague a successful pipelining session - for instance a site may have some documents with a large processing time on the server and that "think time" can really gum up the pipeline. By segregating the types of documents we can turn off pipelining for whatever is DB driven (e.g. XHTML) while still chasing down deep pipelines for images. Facebook is a terrific example of this.. the contents of your wall or your friend list probably have to be scooped out of a database computation and composed in real time, but the dozens of icons they reference are just whipped quickly right out when you have their direct key.

This concept of negative feedback (in the example above it would be a read of the HTML page with a latency well in excess of the RTT to the server) is what drives the server state. Very large responses, sub HTTP/1.1 responses, cancelled pipelines, closed connections, and server headers known to be associated with broken pipeline implementations all trigger negative feedback too.

Most feedback is tied to the classification of the particular request, but some is applied to all classes on the server. Corrupted data is one such case - if the response contains an invalid MD5 sum, or uses the proposed Assoc-Req header but fails to provide the correct information, or appears to have non-sensical framing information, all requests to that server are prevented from using pipelines for a long time. In the past such corruptions have happened due to buggy or compromised servers and intermediaries.

Much like congestion control, the determination of whether or not to proceed with a pipeline is based on both negative and positive feedback. Initially pipelines are not sent. We must first see a HTTP/1.1 response header from the server. After that has been accomplished we try and send a pipeline of only depth 2 and only on a single connection. If both of those responses are received OK then we transition from that tentative state (known internally as yellow) to a position where each connection can send pipelines of up to depth 4 instead of opening all the way. Even after the depth-of-2 test succeeds we can still not be certain the topology supports pipelines - what appears as a short pipeline at the client may not appear that way at the server due to race conditions inherent in network transfer, but the extension to a depth of 4 for every connection still represents significant additional capacity over a single connection being allowed a depth of 2. During this phase concurrent connections are of course also used, so nothing is slowed down over the historical pipelining disabled setting.

Once a transaction that was sent at at least a depth of 3 is successfully received the depth limits are removed from all connections to that host. The new maximum depth is only governed by a configuration preference with the default of 32. Should one of the negative events mentioned earlier occur, that class of transaction is placed in the red state (i.e. no pipelining is allowed for a period of time to that server) generally without interfering with the other transactions.

As mentioned, an unexpected delay of a few hundred milliseconds is considered to be think time and feedback is applied to keep things running smoothly with only a barely perceptible bump in the road. But a longer delay not only applies feedback for future use but it also cancels any transactions pipelined after the currently delayed one and moves them to new connections. This helps mitigate the head of line blocking problem if it is really severe and as a side effect no more pipelines will happen in the near future with that server. In that sense firefox is self correcting in a hostile environment.

All of this negative information expires over time but while it is still valid firefox will keep it persistently between restarts - its value was hard earned.

XMLHttpRequests are a special source of pain in traditional pipeline implementations. They often implement long polling patterns where a request is sent to the server and the server intentionally hangs until some external event occurs - it then shares that information with the client by forming a response. It is essentially server push from a data point of view implemented with a long polling request. For that reason, XHR is by default classified as solo, but meta data can be supplied in the form of an HTTP request header to allow the application to indicate the request is expected to be fulfilled quickly.

As they say on late night TV, "But Wait! There's more!" Sometimes the pipelining infrastructure problem isn't on the server side, sometimes it is a proxy that is part of the client's topology. To test for that there is a new startup test which initiates a deep pipeline to a known pipeline friendly resource on the Internet. This server intentionally defers its responses until a deep pipeline has been received there and only then leaks a small response. Both sides continue a pipeline on the same connection until it has been confirmed that an entire window of data can be supported by the HTTP devices on that path. The results of this test are cached for a very long time. This transfer also takes the opportunity to share with the client a list of host names that should be blacklisted with respect to pipelining because of known operability problems. (You can still use them - just not pipeline with them). The location (and enablement) of the test and hostname blacklist server are configurable.

All of this feedback and history tracking sounds intimidating, but the truth is that most of the web works just fine. There are enough problems that a feedback driven self correcting implementation helps smooth out the bumps, but most of the web opens right up without any problems.

Mail your comments to me or provide a talk back link.

The Value of HTTP Pipelines

For the past few months I've been on a personal quest to implement a safe and effective HTTP pipelining strategy for Mozilla Firefox. The pipeline concept is part of HTTP/1.1 but for various reasons has not been widely deployed on the web.

I'm going to make a series of three posts. This one will be basic background information - what pipelines are in the abstract and why they are more relevant to today's web than they have been in years. The second post will be about the details of my Firefox implementation and its mechanisms for dealing with the realities of the occasional pipeline-hostile piece of infrastructure. The last post will share some performance metrics of those patches in common use cases.

HTTP Pipelines - 101

A client forms a pipeline simply by sending a second HTTP request down an HTTP connection before it has received the first response. A pipeline may be more than two transactions deep. HTTP responses are still required to be returned in the order their requests arrived.

This is a simple concept, but the benefits are profound.

Chiefly, the pipelined transactions benefit by eliminating the latency involved in sending their request. The greater the latency, the greater the benefit.

This crude diagram shows 2 normal HTTP transactions without pipelining. In it each request is shown with a rightward arrow and the response is shown with a leftward arrow. The empty space represents the network latency. When the first response is received, the second is sent.

When pipelining is used, the picture becomes different. Even though the request and response sizes are the same, and the round trip time between the hosts is the same, the overall transaction completes sooner. It is obvious that the round trip latency between the client and the server has been mitigated as a problem.

The greater the latency and the deeper the pipeline the more benefit is captured.

While bandwidth continues to improve, latency is not keeping up. The trend is toward mobile networks and those have latencies 5x to 20x worse than broadband WAN connections. This increased latency is an opportunity for pipelining.

HTTP Pipelines - 201

Conceptually, parallel HTTP connections can garner similar benefits. Afterall, independent HTTP connections do not need to wait for each other to finish before beginning their work.

Up to a point, that is completely true. Parallelism has been the mainstay for many years and served us well. But it has its limits.

The chief limit is actually specified. No more than 6 (or 4, or 2 depending on the point in history) simultaneous connections are supposed to be allowed from the user agent to the server. This is a serious handicap - Firefox may easily have a request queue of over 100 images, stylesheets, and javascript items to fetch upon parsing a Facebook HTML page full of photos, emoticons, ads, and widgets. A single digit number of connections is far too small to effectively parallelize that download.

This brings us to the reason there is a limit on the number of connections. Making a new TCP connection sucks. Before it can be used at all it requires a high latency three way handshake and if it is over TLS an expensive RSA operation on the server (and yet another round trip). Making the new connection requires access to shared data structures on the server in a way that using an existing connection does not and harms scalability. Servers that can pump multiple gigabits a second of TCP data in their sleep through a few connections, can still only initiate on the order of tens of thousands of connections a second.

Maintaining all those connections is expensive for the server too. Each one takes state (i.e. RAM in the kernel, perhaps a thread stack in the application), and the server now has to deal with sorting through more control blocks and application states everytime a packet comes in in order to match it with the right one. Because of the increased diversity of TCP control blocks in use, the L2/L3 cache is severely polluted which is the number one factor in high performance TCP.

Even if pipelines only mean browser performance equal to parallel connections but accomplished using fewer connections then that is a win for Web scalability.

HTTP Pipelines - 301

The details of TCP start to weigh in heavily in favor of pipelining when parallelism and pipelining are compared.

Most significantly, TCP slow-start performs poorly in the parallelized environment. To recap what you already know: Slow start requires the sender to send a conservative amount of data at first (typically 3 or 4 packets) and then wait a full round trip time to receive positive feedback in the form of an ACK from the recipient. At that time it can grow the window a little bit and send another burst and then wait again. After enough iterations of this the sender is generating enough traffic in each burst to "fill the window" and it no longer has to wait.

Pipelining, of course, does not eliminate slow start. But it does use fewer connections to move the same amount of data and the slow-start penalty scales with the number of connections not the amount of data. Fewer connections mean less data is transferred under the artificially slow conditions of slow start.

There is another wrinkle that applies to even parallel connections that have paid their start-up dues and progressed past slow start. TCP connections that have not sent data within an RTO (think of an RTO as a round trip time plus a grace period) are supposed to go back to slow-start! From a TCP point of view this makes some sense - the inactivity means the TCP stack has lost the ack-clock that it needs to use the connection aggressively. But for even a persistent use of a parallel HTTP connection this is a disaster. Effectively each transaction response must go through slow start because it takes more than a round trip for the last packet of the previous response to travel to the client, the client to form the next request, the request to travel to the server and the server to form the response. Pipelined connections do not have that problem - one response follows immediately on the heels of the previous response which ensures optimal TCP use of the available bandwidth. Huzzah.

Lastly, it is worth talking about Google's push for larger initial windows during slow start. They support a value of 10 instead of the current 3 or 4. Basically I think this is a good thing - the Internet can support larger bursts than we are currently sending. Unfortunately, this has a terrible potential interaction with parallel connections. 6 connections would essentially mean 60 congestion control packet credits available to be sent in a burst at any time. Just as 3 is too few for a single connection environment, 60 is too many for a multiple connection environment. Whereas pipelines reduce the reliance on parallel connections, they bring down the total number of packet credits outstanding at any time while still allowing for more effective slow start capacity probing. That's a win win in my book.

Wednesday, February 24, 2010

Forwarding Decisions with Bloom Filters

Really neat paper: "Hash, Don't Cache: fast packet forwarding for enterprise edge routers". The paper and abstract are both available on line. The authors are Minlan Yu, and Jennifer Rexford - both of Princeton.

The authors share a lament of mine - caching forwarding decisions is attractive but no longer realistic as an implementation. The diversity of addresses (including those generated randomly by attackers) pollutes the cache and sends way too much traffic to slow fallback processing paths. So we end up with routers with one class of memory and no caches. Generally their memory is made totally from the really expensive fast stuff.

But the suggestion in this paper of using hashed bloom filters instead of caches is a really cool one. Essentially maintain one filter per interface in fast memory (sram probably) and evaluate those at forwarding time in parallel if you can. You probably just get a single hit and can forward the packet onwards.. you do this with only needing enough fast ram for the hash filters (i.e. not much).. whereas all the traditional trie data and routing update code can live in slow and cheap dram.

of course, due to the false positive nature of bloomies it is possible that you'll get more than one match that is going to require some kind of (probably slower) fallback plan for that packet. But the problem is nowhere near as dire as it is with caches.. with caches the misses force all the valid entries out of the cache and then all traffic is slowed down as the cache rate drops - with bloomies only the packets with the false positives are impacted, almost all of the traffic goes through the fast path unimpacted. The paper puts the false positive rate somewhere on the order of 1 in tens of thousands (depending on a bunch of factors - but that gives a feel for it.) Furthermore a cache can be intentionally attacked with a diversity of addresses in order to flood the cache and impact service for everyone - where under bloom filters such an attack is no more or less likely to impact service than any other kind of packet.

What can't you apply a bloom filter to? It's all pretty cool. The paper has a number of other details on how to minimize false positives and efficiently process route updates.

There is a related work from the same authors on a "buffalo" architecture which, after just a quick glance, appears to apply similar principles to LAN switching.

J Rexford is also an author of Web Protocols and Practice: HTTP/1.1, Networking Protocols, Caching, and Traffic Measurement which is probably the most practical description of the major web protocol I've ever seen.

Tuesday, February 16, 2010

Googling Harder

A while back I mentioned that Google thinks TCP ought to be more aggressive.

I must admit, this matches my own bias. I can barely count the number of applications I have watched wait for network I/O when there was plenty of CPU and idle bandwidth available. It's maddening. Sometimes it's slow start or another aspect of congestion control, sometimes it is outdated things like the nagle algorithm.

Well, Google is back at it with this slide set. (PDF)

They make the argument for increasing the initial cwnd. More provocatively, they argue that the Web has already done so in a defacto way by going to aggressive numbers of independent parallel HTTP connections (where you essentially get new cwnd credits just for opening a new TCP stream). Clever argument. Maybe you want to pace the data after 3 or 4 packets based on the RTT of the handshake - so you don't overrun any buffers un-necessarily.

Frankly, this kind of thing can be implemented on the server side without ever telling the peer. It would make some sense for Google to just do this for a few different values of cwnd on a tiny fraction of their traffic and see if the packet loss rates change and then publish that.

Monday, August 10, 2009

Making the switch

Well that's funny.

Just a few hours after my last post, which suggested that virtio based networking might be getting bested by the not-in-userspace v-bus, Michael Tsirkin posts an in-kernel backend to virtio. Which puts the two on more or less the same procedural footing.

Fire up the benchmarks?

(not) switching contexts.

a lot of bits have been spilled over virtual network performance in v-bus vs virtio-net/virtio-pci.. (aka alacrity vm vs traditional kvm/qemu).. this includes some pretty sensational(ist?) performance graphs: here.

There are lots of details (and details do matter) but the first-order issue can probably be summed up thusly, from Avi Kivity on lklm:

The current conjecture is that ioq outperforms virtio because the host side of ioq is implemented in the host kernel, while the host side of virtio is implemented in userspace.

Perhaps context switching isn't such a minor detail afterall.

Tuesday, July 21, 2009

Caution - (S)Low Bridge Ahead

This post will not be satisfying. Someone has posted some great datapoints about virtualized packet forwarding, which is great. But they don't make a lot of sense. Which is not great. Nor is it satisfying.

Oh well, I'm sure there will be a followup sometime in the future.

In this thread, Or Gerlitz posts a new networking type for qemu (and by extension) kvm which are of course popular linux host virtualization packages. The networking type is "raw" and the driver couldn't be more simple - a (v)lan interface on the host is opened with a AF_PACKET socket and all of the packets that appear there are shoved through to the guest interface, and vice versa.

This is a pretty direct way of doing things, but it has the unfortunate side effect that all of the guests and the host itself are aggregated onto one upstream switch port without any kind of bridge, switch, or router in between. This means that unless the upstream switch can do a u-turn when forwarding (and most of them will not), all of the guests and the host are isolated from each other. The normal way of doing things is to attach the guests and host together with a tun/tap socket and run a bridge on host. This bridge will do all the necessary forwarding so that everybody has full connectivity, and it lets you run iptables and ebtables on the host to boot.

That's all well and good, but the really interesting part was the motivation for running around tun/tap/bridge anyhow: the poster runs a test with short udp transmissions over gige.. running it between two real (non-vm) hosts he sees 450K packets per second. The post doesn't mention what hardware is involved, so we'll just take it as a black box baseline. Switching the sender to be a qemu guest with traditional tap/bridge networking it plummets to just 195K. The "raw" interface gets that back up to 240K - which is still a far cry from 450, eh?

Tap mode has 3 times the context switches than the raw version. I don't think I saw a number for the non-vm test. Other than that nothing, including the profiles, really jumps out.

The whole thread is worth reading - but the main data points are here and here

Monday, July 20, 2009

What is eating those Google SYN-ACKs?

In this post, I mentioned google was seeing huge packet loss on syn-acks from their servers. At the time it looked like 2%. That sounded nuts.

It still sounds nuts.

Someone else on the mailing list posted about that, and Jerry Chu of Google confirmed it:

Our overall pkt retransmission rate often goes over 1%. I was
wondering if SYN/SYN-ACK pkts are less likely to be dropped
by some routers due to their smaller size so we collected traces
and computed SYN-ACK retransmissions rate on some servers.
We confirmed it to be consistent with the overall pkt drop rate,
i.e., > 1% often.

You could imagine why the overall retransmission rate might be higher than the real drop rate due to jitter and various fast retransmit algorithms that might retransmit things that just hadn't been acknowledged quite yet. Even SYNs might be dropped at the host (instead of the network) due to queue overflows and such.. but we're talking about SYN-ACKs from busy servers towards what one would expect would be pretty idle google-searching clients. And these SYN-ACKs have giant timeouts (3 seconds - which is why Jerry was writing in the first place) so it certainly isn't a matter of over-aggressive retransmit. The only explanation seems to be packet loss. At greater than 1%

wow.

This probably has more to do with the global nature of google's audience than anything else. But still, TCP can really suck at loss rates that high. It must be very different than the desktop Internet I know (which is a fair-to-middlin cable service, not a fancy Fiber-To-The-Home setup which is becoming more common.)

I wonder exactly where those losses happen.

Tuesday, July 14, 2009

Google Thinks TCP should be more Aggressive by Default

Really interesting post from Jerry Chu of Google. He says Google has data which shows that we ought to lower the initial RTO, increase the initial CWND, drop the min RTO, and reduced the delayed ack time out in TCP.

Based on my own anecdotal data, I've done stuff like that in products I've worked on. Let's face it - 3 seconds is a freaking eternity. Processors, networks, and busses have all scaled but these constants remain the same. Jerry says Google has data that shows this is important. As the google data set is no doubt much more extensive than any I worked with, that's a really welcome post.

Probably the most important data point Jerry shares is that "up to a few percentage points" in his data set exhibit a SYN-ACK retransmission from the google servers. Wow. (at least) 1 in 50 syn-acks needs to be retransmitted? That's not my experience at all, and if true on google scale it is absolutely fascinating. Are they generally seeing 2% packet loss on google tx? There's no way that they are seeing that.. google would appear to suck! So what's going on... ? Why is syn-ack rexmitted more than anything else? (and I'm assuming they are indeed lost, because otherwise lowering the timeout wouldn't be the right remedy..)

Sunday, June 28, 2009

'Violation' is so prejorative

Ben Hutchings says:

>[...] we also have architectural issues in violating layered
> software design

Meanwhile, in the real world, we want to avoid copying data, so an skb doesn't belong to any specific protocol layer.

Thank You Ben!

Abstractions aren't inherently good. They are great if they help you build or maintain things that are otherwise too complex to understand or too tediuos to work on - but we have to vigilantly remember that losing those details also sometimes restricts the quality of what we can build too as somethings are just inherently complex.

Saturday, November 8, 2008

DNS Prefetching for Firefox

Recently I implemented patches to implement DNS prefetching in Firefox. I am primarily interested in their impact on Fennec (aka Mobile Firefox), but it looks like they will land first in Firefox 3.1 beta2. The, hopefully final, glitches are being shaken out of the patch now.

Google Chrome has a feature like this too.

DNS resolutions are always dominated by latency instead of bandwidth. Particularly on mobile networks the latencies are very high. That makes them perfect candidates for speculative pre-fetching. The advantage is in the latency improvement - instead of waiting for a hostname lookup when you click a link, do that lookup while you're reading the page the link is embedded in. Because the lookups are so small (generally one runt packet in and out) the cost of any wasted over optimistic lookups really doesn't impact the performance of browsing. Good payoff at low cost, the best of both worlds.

The basic benefit is simple: if you click on a link using a new hostname, you save a round trip time. On some networks this can be a substantial improvement (800ms or more) in responsiveness. Some describe this simply as "figuring out the IP address of every link before you click on it".

The Firefox implementation takes this approach one step further than just pre-resolving anchor href hostnames. It uses the prefetch logic on URLs that are being included in the current document. By this I mean that it uses the prefetch logic on things like images, css, and jscript that are being loaded right away, in addition to anchor links which might be clicked on at a slightly later time.

At first that seems non-sensical. How can you pre-fetch the DNS for something you are fetching right now? Where does the "pre" come in? The answer is not so much in the definition of "pre" as it is in the definition of "right now". Most HTTP User Agents, Firefox being no exception, limit their number of simultaneous connections and hosts. Typical pages embed quite a few objects and it is easy to run into these limits. When this happens the browser queues some of the requests. The Firefox pre-fetch DNS implementation allows those queued requests to overlap the high latency host resolution with whatever transfers might be going on without creating an excess level of parallelization.

While this is just a secondary benefit, it can be meaningful. For example, on the day I grabbed a snapshot of http://planet.mozilla.org/ it required 23 unique DNS resolutions in order to render the base page. Most of these were in img URLs. When loading the page with the prefetch patches, even with a cold cache, 16 of them were either fully completed when needed for the first connection or at least already in progress. The result, measured on an EDGE network, was a 4% overall improvement in page load time. Not bad for something that does not reduce bandwidth consumption in any way.

Configuration

Basically, it just works. You don't need to do anything. But there are a few configurables out there for both browsers and content providers.

First, as a browser you might want no part of this. Fair enough - its your browser. If you set the preference network.dns.disablePrefetch to true the prefetch code will never take effect, no matter what any other configuration is set to.

Furthermore, as a security measure, prefetching of embedded link hostnames is not done from documents loaded over https. If you want to allow it in that context too, just set the preference network.dns.disablePrefetchFromHTTPS to true.

Content providers have a couple neat tricks available too. These are meant to be compatible with Chrome.

For content to opt out of DNS pre-resolution it can be served with the x-dns-prefetch-control HTTP header set to off. The equivalent meta http-equiv element can be used instead of a response header too:
<meta http-equiv="x-dns-prefetch-control" content="off">

Setting content to on will reverse the effect. You can never turn pre-fetching on in a browser that has it disabled by preference, but you can undo the impact of a previous x-dns-prefetch control command. In this way, different content provider policies can apply to different portions of the document.

The last configuraton possibility allows the content provider to force the lookup of a particular hostname without providing an anchor using that name. This is done with the link tag:
<link rel="dns-prefetch" href="http://www.spreadfirefox.com/">

The href attribute can contain a full URL, or just a hostname. Hostname only attributes should preceed the hostname with two slashes:
<link rel="dns-prefetch" href="//www.spreadfirefox.com">

Content providers might use the link notation in a site-wide home page in order to preload hostnames that are widely used throughout the site but perhaps not on the home page.