Bits Up!: latency

Showing posts with label latency. Show all posts

Tuesday, September 22, 2015

Brotli Content-Encoding for Firefox 44

The best way to make data appear to move faster over the Web is to move less of it and lossless compression has always been a core tenet of good web design.

Sometimes that is done via over the top gzip of text resources (html, js, css), but other times it is accomplished via the compression inherent in the file format of media elements. Modern sites apply gzip to all of their text as a best practice.

Time marches on, and it turns out we can often do a better job than the venerable gzip. Until recently, new formats struggled with matching the decoding rates of gzip, but lately a new contender named brotli has shown impressive results. It has been able to improve on gzip anywhere from 20% to 40% in terms of compression ratios while keeping up on the decoding rate. Have a look at the author's recent comparative results.

The deployed WOFF2 font file format already uses brotli internally.

If all goes well in testing, Firefox 44 (ETA January 2016) will negotiate brotli as a content-encoding for https resources. The negotiation will be done in the usual way via the Accept-Encoding request header and the token "br". Servers that wish to encode a response with brotli can do so by adding "br" to the Content-Encoding response header. Firefox won't decode brotli outside of https - so make sure to use the HTTP content negotiation framework instead of doing user agent sniffing.

[edit note - around Oct 6 2015 the token was changed to br from brotli. The token brotli was only ever deployed on nightly builds of firefox 44.]

We expect Chrome will deploy something compatible in the near future.

The brotli format is defined by this document working its way through the IETF process. We will work with the authors to make sure the IANA registry for content codings is updated to reference it.

You can get tools to create brotli compressed content here and there is a windows executable I can't vouch for linked here

Monday, January 14, 2013

On Induced Latency and Loss

The estimable Mark Allman has a new paper out in the latest ACM CCR dealing with data collection around buffering in the Internet. If you respect a good characterization paper as much as I do - you should read it and a couple related mailing list threads here and here. The paper is only 7 pages; go read it -- this blog post will wait for you.

The paper's title, Comments on Bufferbloat, in my mind undersells the contributions of the paper. Colloquially I consider bufferbloat to refer specifically to deeply filled queues at or below the IP layer that create latency problems for real-time applications using different transport layer flows. (e.g. FTP/Torrent/WebBrowser creating problems for perhaps unrelated VOIP). But the paper provides valuable data on real world levels of induced latency where it has implications for TCP Initial Window (IW) sending sizes and the web-browser centric concern of managing connection parallelism in a world with diverse buffer sizes and IWs. That's why I'm writing about it here.

Servers inducing large latency and loss through parallelized sending is right now a bit of a corner case problem in the web space that seems to be growing. My chrome counterpart Will Chan talks a bit about it too over here.

For what its worth, I'm still looking for some common amounts of network buffering to use as configurations in simulations on the downstream path. The famous Netalyzer paper has some numbers on upstream. Let me know if you can help.

The traffic Mark looked at was not all web traffic (undoubtedly lots of P2P), but I think there are some interesting take aways for the web space. These are my interpretations - don't let me put words in the author's mouth for you when the paper is linked above:

There is commonly some induced delay due to buffering.. the data in the paper shows residential rtts of 78ms typically bloated to ~120ms. That's not going to mess with real time too badly, but its an interesting data point.
The data set shows bufferbloat existence at the far end of the distribution, At the 99th percentile of round trips on residential connections you see over 900ms of induced latency.
The data set shows that bufferbloat is not a constant condition - the amount of induced latency is commonly changing (for better and for worse) and most of the time doesn't fluctuate into dangerous ranges.
There are real questions of whether other data sets would show the same thing, and even if they did it isn't clear to me what the acceptable frequency of these blips would be to a realtime app like VOIP or RTC-Web.
IW 10 seems reasonably safe from the data in Mark's paper, but all of his characterizations don't necessarily match what we see in the web space. In particular the size of our flows are not as often less than 3 packets in size as they are in the paper's data (which is not all web traffic). There are clearly deployed corner cases on the web where server's send way too much data (typically through sharding) and induce train wrecks on the web transfer. How do we deal with that gracefully in a browser without sharding or IW=10 for hosts that use them in a more coordinated way? That's a big question for me right now.

[ Comments on this are best done at https://plus.google.com/100166083286297802191/posts/6XB59oaQzDL]

Wednesday, April 25, 2012

Making Firefox Search Snappier

The Firefox 15 development window just opened and I checked into inbound a cool feature that had been sitting in my queue for a little while. Let's see if it sticks!

The feature basically lets non-network code hint to the networking layer that it will probably send a http transaction to a particular site soon, but it isn't ready to do so yet. The network code can take the hint and begins to setup a TCP and (if necessary) SSL handshake right away because these are high latency operations.

The primary initial user of this is the search box in firefox. When you focus on that box we will probably make a connection to the search provider right away. Simultaneously with this you type your search terms - when your search is ready (or partly ready if you are using search suggestions) it can be submitted to the search service without any waiting for network setup.

This can be a significant win. The average Internet round trip time is about 100ms (there is a lot of variation in this). It takes 1 RTT to setup TCP and (likely) 2 more for SSL. 300ms is a notable pause, but overlapped in the background it essentially becomes free resulting in snappier searches that still use a secured transport. If you're using a SPDY enabled search provider such as google or twitter, this is done over SPDY, so the one TCP session now established will be able to carry all of your search results - no more setup overhead to worry about with additional parallel connections etc..

The other user of this feature that got checked in as part of this merge is actually internal to the networking code just before the cache I/O is done. The amount of time it takes to check the disk cache is extremely variable - afaict generally its pretty fast but the tail can be really awful depending on hardware, other system activity, OS, etc.. So we overlap the network handshakes with this activity that figures out the values of the If-Modified-Since (etc..) request headers.

There is an IDL for providing the hint (nsISpeculativeConnect) - so if you can think of other areas ripe for this kind of optimization let's get to it!

[The best place for comments is probably here: https://plus.google.com/100166083286297802191/posts ]

Saturday, January 7, 2012

Using SPDY for more responsive interfaces

RST_STREAM turns out to be a feature of spdy that I under appreciated for a long time. The concept is simple enough - either end of the connection can cancel an individual stream without impacting the other streams that are multiplexed on the same connection.

This fills a conceptual gap left by HTTP/1.x. - In HTTP when you want to cancel a transaction about all you can do is close the connection.

Consider the case of quickly clicking through a webmail mailbox - just scanning the contents and rapidly clicking 'next'. Typically the pages will be only partly loaded before you move on to the next one. Assuming you have used up all your parallelism in HTTP/1, the new click will either have to wait for the old transactions to complete (wasting time and bandwidth) or cancel the old ones by closing those connections and then open fresh connections for the new requests. New connections add 2 or 3 round trip times to reopen the SSL connection (you are reading email over SSL, right?) before they can be used to send the new requests. Either way - that is not a good experience.

An interactive map application has similar attributes - as you scan along the map, zooming in and out, lots of tiles are loaded and are often irrelevant before they are drawn. I'm sure you can think of other scenarios that have cancellations.

Spdy solves this simply - with its inherently much greater parallelism the new requests can be sent immediately and at the same time cancel notifications can go out for the old ones. That saves the bandwidth and gets the new requests going as fast as possible without interfering with either the established connection or any other transactions also in progress.

A page load time metric won't really show this to you but the increased responsiveness is very obvious when working with these kinds of use cases - especially under high latency conditions that make connection establishment slower.

Thursday, December 8, 2011

SPDY, Bufferbloat, HTTP, and Real-Time Networking

Long router queue sizes on the web continue to be a hot networking topic - Jim Gettys has a long interview in ACM queue. Large unmanaged queues destroy low latency applications - just ask Randell Jessup

A paper like this does a good job of showing just how bad the situation can be - experimentally driving router buffering delay from 10ms to ~1000ms on many common broadband cable and DSL modems. I wish the paper had been able to show me the range and frequency of that queue delay under normal conditions.

I'm concerned that decreasing the router buffer size, thereby increasing the drop rate, is detrimental to the current HTTP/1.x web. A classic HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't save you much - it has sent much of what it needs to send already anyhow. Unless you drop almost all of that flow from your buffers you haven't achieved much. Further, a loss event has a high chance of damaging the flow more seriously than you intended - dropping a SYN or the last packet of the data train is a packet that will have very slow retry timers, and short flows are comprised of high percentages of these kinds of packets. non-drop based loss notification like connex/ecn do less damage but are ineffective again because the short flow is more or less complete when the notification arrives so it cannot adapt the sending rate.

The problem is all of those other parallel HTTP sessions going on that didn't get the message. Its the aggregate that causes the buffer build up. Many sites commonly use 60-90 separate uncoordinated TCP flows just to load one page.

Making web transport more adaptable on this front is a big goal of my spdy work. When spdy consolidates resources onto the same tcp flow it means the remaining larger flows will be much more tcp friendly. Loss indicators will have a fighting chance of hitting the flow that can still backoff, and we won't have windows growing independently of each other. (Do you like the sound of IW=10 times 90? That's what 90 uncorrelated flows mean. IW=10 of a small number of flows otoh is excellent.). That ought to keep router queue sizes down and give things like rtcweb a fighting chance.

It also opens up the possibility of the browser identifying queue growth through delay based analysis and possibly helping the situation out by managing inside the browser our bulk tcp download rate (and definitely the upload rate) by munging the rwin or something like that. If it goes right it really shouldn't hurt throughput while giving better latency to other applications. It's all very pie in the sky and down the road, but its kind of hard to imagine in the current HTTP/1.x world.

Friday, September 23, 2011

SPDY: What I Like About You.

I've been working on implementing SPDY as an experiment in Firefox lately. We'll have to see how it plays out, but so far I really like it.

Development and benchmarking is still a work in progress, though interop seems to be complete. There are several significant to-do items left that have the potential to improve things even further. The couple of anecdotal benchmarks I have collected are broadly similar to the page load time based reports Google has shared at the IETF and velocity conf over the last few months.

tl;dr; Faster is all well and good (and I mean that!) but I'm going to make a long argument that SPDY is good for the Internet beyond faster page load times. Compared to HTTP, it is more scalable, plays nicer with other Internet traffic and brings web security forward.

SPDY: What I Like About You

#1: Infinite Parallelism with Shared Congestion Control.

You probably know that SPDY allows multiplexing of multiple HTTP resources inside one TCP stream. Unlike the related HTTP mechanisms of pipelining concurrent requests on one TCP stream, the SPDY resources can be returned in any order and even mixed together in small chunks so that head of line blocking is never an issue and you never need more than one connection to each real server. This is great for high latency environments because a resource never needs to be queued on either the client or the server for any reason other than network congestion limits.

Normal HTTP achieves transaction parallelism through parallel TCP connections. Browsers limit you to 6 parallel connections per host. Servers achieve greater parallelism by sharding their resources across a variety of host names. Often these host names are aliases for the same host, implemented explicitly to bypass the 6 connection limitation. For example, lh3.googleusercontent.com and lh4.googleusercontent.com are actually DNS CNAMEs for the same server. It is not uncommon to see performance oriented sites, like the Google properties, shard things over as many as 6 host names in order to allow 36 parallel HTTP sessions.

Parallelism is a must-have for performance. I'm looking at a trace right now that uses the afore mentioned 36 parallel HTTP sessions and its page load completes in 16.5 seconds. If I restrict it to just 1 connection per host (i.e. 6 overall), the same page takes 27.7 seconds to load. If I restrict that even further to just 1 connection in total in takes a mind numbing 94 seconds to load. And this is on 40ms RTT broadband - high latency environments such as mobile would suffer much much worse! Keep this in mind when I start saying bad things about parallel connections below, they really do great things and the web we have with them enables much more impressive applications than a single connection HTTP web ever could.

Of course using multiple parallel HTTP connections is not perfect - if they were perfect we wouldn't try to limit them to 6 at a time. There are two main problems. The first is that each connection requires a TCP handshake which incurs an extra RTT (or maybe 3 if you are using SSL) before the connection can be used. The TCP handshake is also relatively computationally hard compared to moving data (servers easily move millions of packets per second, while connection termination is generally measured in the tens of thousands), the SSL handshake even harder. Reducing the number of connections reduces this burden. But in all honesty this is becoming less of a problem over time - the cost of maintaining persistent connections is going down (which amortizes the handshake cost) and servers are getting pretty good at executing the handshakes (both SSL and vanilla) sometimes by employing the help of multi-tiered architectures for busy deployments.

The architectural problem lies in HTTP's interaction with TCP congestion control. HTTP flows are generally pretty short (a few packets per transaction), tend to stop and start a lot, and more or less play poorly with the congestion control model. The model works really well for long flows like a FTP download - that TCP stream will automatically adapt to the available bandwidth of the network and transfer at a fairly steady rate for its duration after a little bit of acclimation time. HTTP flows are generally too short to ever acclimate properly.

A SPDY flow, being the aggregation of all the parallel HTTP connections, looks to be a lot longer, busier, and more consistent than any of the individual parallel HTTP flows would be. Simply put - that makes it work better because all of that TCP congestion logic is applied to one flow instead of being repeated independently across all the parallel HTTP mini flows.

Less simply, when an idle HTTP session begins to send a response it has to guess at how much data should be put onto the wire. It does this without awareness of all the other flows. Let's say it guesses "4 packets" but there are no other active flows. In this case 4 packets is way too few and the network is under utilized and the page loads poorly. But what if 35 other flows are activated at the same time - this means 140 packets get injected into the network at the same time which is way too many. Under that scenario one of two things happen - both of them are bad:

Packet Loss. TCP reacts poorly to packet loss, especially on short flows. While 140 packets in aggregate is a pretty decent flow, remember that total transmission is made up of 35 different congestion control blocks - each one covering a packet flow of only 4 packets. A loss is devastating to performance because most of the TCP recovery strategies don't work well in that environment.
Over Buffering. This is what Jim Gettys calls bufferbloat. The giant fast moving 140 packet burst arrives at your cable modem head where the bandwidth is stepped down and most of those packets get in a long buffer to wait for their turn on your LAN. That works OK, certainly better than packet loss recovery does in practice, but the deep queue creates a giant problem for any interactive traffic that is sharing that link. Packets for those other applications (such as VOIP, gaming, video chat, etc..) now have to sit in this long queue resulting in interactive lag. Lag sucks for the Internet. The HTTP streams themselves also become non-responsive to cancel events because the only way to clear those queues is to wait them out - so clicking on a link to a new page is significantly delayed while the old page that you have already abandoned continues to consume your bandwidth.

This describes a real dilemma - if you guess more aggressive send windows then you will have a better chance of filling the network but you will also have a better chance of packet loss or over buffering. If you guess more conservative windows then loss and buffering happens less often but nothing ever runs very quickly. In the face of all those flows with independent congestion control blocks, there just isn't enough information available. (This of course is related to the famous Initial Window 10 proposal, which I support, but that's another non SPDY story.)

I'm sure you can see where this is going now. SPDY's parallelism, by virtue of being on a single TCP stream, leverages one busy shared congestion control block instead of dealing with 36 independent tiny ones. Because the stream is much busier it rarely has to guess at how much to send (you only need to guess when you're idle, SPDY is more likely to be getting active feedback), if it should drop a packet it reacts to that loss much better via the various fast recovery mechanisms of TCP, and when it is competing for bandwidth at a choke point it is much more responsive to the signals of other streams - reducing the over buffering problem.

It is for these reasons that SPDY is really exciting to me. Parallel connections work great - they work so great that it is hard to have SPDY significantly improve on the page load time of highly sharded site unless there is a very high amount of latency present. But the structural advantages of SPDY enable important efforts like RTCWeb as well as provide better network utilization and help servers scale when compared to HTTP. Even if page load times only stay at par, those other good for the Internet attributes make it worth deploying.

#2: SPDY is over SSL every time.

I greatly lament that I am late to the school of SSL-all-the-time. I spent many years trying to eek the greatest amount of server responses per watt that was possible. I looked at SSL and saw impediments. That stayed with me.

I was right about the impediments, and I've learned a lot about dealing with them, but what I didn't get was that it is simply worth the cost. As we have all seen lately, SSL isn't perfect - but having a layer of protection against an entire class of eavesdropping attacks is a property that should be able to be relied upon in every protocol as generic as HTTP. HTTP does not provide that guarantee - but SPDY does.

huzzah.

As a incentive to make the transition to SSL all the time, this makes it worth deploying by itself.

#3:Header compression.

SPDY compresses all the HTTP-equivalent headers using a specialized dictionary and a compression context that is reserved only for the headers so it does not get diluted with non-header references. This specialized scheme performs very well.

At first I really didn't think that would matter very much - but it is really a significant savings. HTTP's statelessness had its upsides, but the resulting on the wire redundancy was really over the top.

As a sample, I am looking right now at a trace of 1900 resources (that's about 40 pages). 760KB of total downstream plain text header bytes were received as 88KB compressed bytes, and upstream 949KB of plain text headers were compressed as just 65KB. I'll take 1.56MB (82%) in total overhead savings! I even have a todo item that will make this slightly better.

Tuesday, February 15, 2011

Note to Self: The Web is Slow

I've made a living dealing with fast networks and servers that run at really impressive transaction rates using all manner of nifty interconnects and parallelism. Sometimes I forget that the day to day web isn't all that fast in comparison.

My local copy of Firefox is annotated to dump a bunch of network stats when shutting down. One of them is a CDF of HTTP handshake times. This is from my desktop, which is connected to a premium cable broadband consumer Internet service. It's not as awesome as FIOS, but its still at the top portion of what a home consumer will have in the US, which in turn has certain geographic advantages when connection to many hosting companies. Its fair to say my performance is going to be at least a bit better than the average Internet user. And it is still slow. (We of course need to work to be able to better characterize what the real spectrum of experience is.)

This isn't scientific. It is where I happen to browse, and its just one datapoint although I can tell you my gut says it is pretty typical output - gathered over 15,000 connections.

Only about half of my handshakes are where I want them to be: < 100ms. Most of the rest fall in the next 300ms. To be fair there is a little skew in here because the code doesn't separate https from http, and SSL has an extra RTT in there. But SSL is a small fraction of the overall sample.

And this is the desktop. Think mobile and wireless.

Latency matters.

Monday, February 14, 2011

The Apex of Pipelines

Every once in a while I'm still surprised at the potential upside of pipelines.

I stumbled across a great example recently: Women In Technology International. That home page is setup in a pretty typical newsletter format. It has 159 resources, 145 of which are images along with about a half dozen pieces of js and css. Most of the images are small, with over 2/3 of them loading in less than 20ms of transfer time (time to first byte removed).

What is striking about this page is how large of an advantage pipelining can give even on a well connected broadband desktop with a 100ms RTT to the witi hosting facility. The average latency to receive the first byte of a resource dropped from 1697ms to 626ms, and the average elapsed time per transaction overall dropped from 1719ms to 652ms. Aggregate that over 159 different resources and you have some serious gains!

But why stop there? The pipeline sweet spot is in high latency situations such as mobile, or trans continental data transfer. This is what happens when we add 200 ms of latency to the connection:

That's right - 3300ms of improvement on each transaction! That seems absurdly good if we only added 200ms of latency, but what you're seeing is the aggregate queueing effect - Firefox wants 150 resources more or less simultaneously and can only parallelize it on 6 connections. If you are 25 positions deep on that queue you will have to wait at least 7500ms just for the back and forth of each transaction in front of you to complete.. obviously not everyone is queued that deeply so the average effect is somewhat less, but still overwhelming.

Tuesday, January 11, 2011

Firefox Idle Connection Selection via CWND

When choosing between more than 1 idle persistent connection FF <= 4.0 goes with a FIFO approach. I was thinking about ways to tune this.

I was going to experiment with making that a LIFO. A LIFO afterall should have better cache behavior and would allow the cache size to shrink to a more natural size as the older members remain untouched and timeout. The FIFO will basically keep the size pinned at its maximum while it cycles through all the connections which wastes system RAM with both the client and server maintaining extra idle TCP sessions. It is also the worst possible processor cache behavior. The possible argument in favor of FIFO is actually that connections are expensive to open and we've already opened these so maybe we want to keep it pinned to its maximum size just in case we need it again - it isn't obvious what to do or if it matters much.

Thinking a little further I realized that the major differentiator between these sockets is not a timestamp at all - it is the CWND associated with them on the server. Many web servers at least partially ignore the TCP suggestion to return to slow start after an RTO of idle activity so it is reasonable to assume that some of the connections have larger CWNDs than others as long as we aren't in an environment where the previous transfers have been actually bottlenecked by bandwidth - and on the web that almost never happens, RTT and document size bottleneck most transfers. Even if available bandwidth is the real bottleneck that just moots our metric, it doesn't provide any information that steers us the wrong way.

When choosing which connection to use we want to choose the one that has the largest CWND on the server. Unfortunately we cannot see directly into the congestion information on the server, but assuming that previous transfers have been bottlenecked by RTT (approximately a constant for all connections to the same server) and transfer size then we can just use the one with the history of moving the largest amount of data in one burst as a proxy for it because that burst is what opens the congestion window.

I say one burst instead of one document because we want to include pipelines of multiple HTTP transactions as long as there wasn't an RTT between them. This is another reason to want pipelines - they will more effectively open up the CWND for us. We also want to use the maximum burst seen on the connection, not the necessarily the last burst - the servers we are interested in don't shrink the CWND just because it isn't all being used for each burst.

The implementation is easy - the idleconnection list is changed from being sorted as a FIFO to being sorted according to this "maxburst" metric. The code and bugzilla are here.

Using an experiment designed to show the best case, the results are better than I expected for such a minor tweak. This was my process:

construct a base page mixed with several small and several large images plus a link to a 25KB image. There are 6 objects on the base page.
load the base page - FF4 will use six parallel connections to do it
click on the link to the 25KB image - this will use an idle persistent connection. Measure the performance of this load.

There is 250ms of RTT and several megabits of bandwidth between the client and server in my test.

As expected, vanilla FF 4.0 (beta 9) loads the target image on an idle persistent connection that was used to transfer one of the smallest images on the base page. It is FIFO afterall, and the small image load was going to finish first and be put into the persistent connection pool first.

My patched browser, using the sort by CWND algorithm, loads the target image on the idle persistent connection that was used to transfer the largest image (2MB+) from the base page.

Their history is the only difference between the two connections - at the time of requesting the 25KB image they are both connected and idle. There is a 250ms RTT between my host and the server.

The difference is huge. The default algorithm takes 3 round trips to transfer the document as it works its way through growing the congestion window. That adds up to 793ms. The sort-by-cwnd algorithm inherits a congestion window large enough for the task at hand and moves it all in one stream - just 363ms.

This is a nice tweak, but it has its limitations - by definition you cannot meaningfully multiply the gain across a large number of transactions. If you have a large number of transactions then you probably are using all your idle connections in a burst and there is no point in discriminating between them if you are just going to use them all.

I would argue if you have that much pressure on the connection pool then you are probably queueing requests and should be using pipelining. If you don't have that much pressure, then this probably helps you.

Wednesday, December 1, 2010

Performance of Pipelining in HTTP Firefox

This post provides some performance measurements of my HTTP pipeline patches for Firefox.

The key benefits of pipelining are reduced transaction latency and potentially the use of fewer connections, so those are generally the metrics I will focus on here. For each of my 5 test cases we will look at the following statistics:

The percentage of requests that are delayed in the request queue inside Firefox waiting for a connection to be available.
The average amount of queue latency for each transaction. This is measured as the time the first byte of the request is given to the kernel minus the time at which the request was presented to Necko/Gecko. It is generally very low but greater than 0 even if the transaction can be placed directly on a pipeline because it takes a moment to construct the request and perhaps schedule the socket thread if other things are on the CPU. But a high value is an opportunity lost - that is time the request could be in flight and the server could be processing it. It includes the time necessary to create a new connection if that is necessary - that includes the three way TCP handshake but not a DNS lookup (which is cached in the test).
The type of connection used for each transaction (new connection, an idle persistent connection, or a pipelined connection)
The average amount of transaction latency for each transaction. This is measured as the time of the first byte of the response being received by firefox minus the time at which the request was presented to Necko/Gecko. It is possible for the average improvement to be greater than 1 RTT because of cumulative queueing delays in the non pipelined case.
The cumulative fraction of transactions completed before 3 different elapsed times in order to show improved execution time for the test case. The benchmark times are sized appropriately for each test case.

There are 4 data points for each criteria in each test case. Because pipelining is aimed at environments with significant latency and my broadband test connectivity has below average latency for much of the world and every mobile environment the first two data points have 200ms of latency added through a traffic shaper. The two datapoints compare pipelining on vs pipelining off. The other data points measure the same things but without the induced latency.

All tests are run with both a disk and memory cache enabled but empty at the beginning of the run. In order to measure the effectiveness of the pipelining, each of these sites has been put in the "green - pipelining ok" state which is normally auto discovered.

Facebook

The first test is on Facebook. It starts by logging in and then selecting a particular user profile and navigating from that page via lists of friends, occasionally pulling up individual profiles, generating lists of recent updates, and pressing the More link on busy Facebook walls. There are approximately 1400 HTTP transactions in each test run.

The first thing to consider is the percent of requests that are delayed (i.e. queued) within firefox. I think queuing is particularly bad because if the request hasn't been passed to the network there is no way for any advances in server technology to ever operate on it. For instance - servers are prevented from returning responses out of order but nothing would prevent them from processing requests out of order in order to overlap latencies in DB queries and disk I/O if the requests were not queued on the browser side.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	79.1
Low Latency	0	78.6

That is a stark contrast. It is possible by the way to see a request being queued with pipelining enabled - a default configuration limit of 32 governs the maximum depth of the pipeline and not all request types are pipeline eligible.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	29	1630
Low Latency	6	285

You might expect to see 0ms for the pipelining case as we just illustrated above that no requests were delayed. But the queue latency covers the time from request submission to the time of putting the first byte of the request on the wire, so that includes any connection setup time when establishing a new connection. That is the primary source of the latency seen here for the pipeline enabled case.

That begs the question, when pipelining is enabled how many of the requests are pipelined?

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	2	4	2	3
Reused Idle	13	96	13	97
Pipeline	85	0	84	0

We see here a moderate reduction in the number of connections used when pipelining, but most of the effect is a transfer from idle persistent connections over to pipelines. While the percentage of new connections as a portion of the overall request stream has gone down just a tick with pipelining, the impact on the actual number of raw connections is significant - going from roughly 60 without pipelining to 30 with it enabled. That boils down to a 50% reduction in the number of connections created which is a significant provides a very busy site like Facebook a significant scalability boost.

The final criteria deal with transaction latency.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	702	1906
Low Latency	341	346

Yowza! Now there is a result. Under conditions with moderate latency the average transaction waits 1200ms less from the time the request is submitted to Necko to the time the first byte of the response header is received. The net effect is so much more than the approx ~250ms RTT because of aggregating queueing delays - without pipelining enabled you are placed in a deep queue which has to be totally cleared with a 1RTT overhead on each one before you are executed. The impact under low latency conditions is probably close to being noise.

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	97	91	75
Moderate Latency wo/Pipeline	45	38	32

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	92	61
Low Latency wo/Pipeline	99	91	61

Facebook is a big success - probably the biggest success of any of the tests. 200+ms latency situations have performance significantly increased, and low latency scenarios perform similarly while using a few less TCP connections.

Amazon.com

The Amazon.com test walks through a basic window shopping experience at Amazon.com. The home page is loaded, the kindle link is clicked, a few more categories are clicked and the lists of products are generally browsed and sorted by "hot and new" and other similar things. This boils down to about 800 HTTP transactions.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	54.4
Low Latency	0	39.8

Right away you can see that amazon queues fewer requests than Facebook, so the potential improvement is less.

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	116	791
Low Latency	12	136

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	13	6	13
Reused Idle	20	96	28	87
Pipeline	68	0	66	0

The first thing to note is that less pipelining is going on that with facebook, so again there is less potential for improvement. How the pages are constructed has a lot to do with this (perhaps fewer images, etc..). But almost as interesting is the fact that the number of TCP connections (i.e. new connections) is halved in the low latency case. If the page can be transferred in the same amount of time using fewer connections that is still a win for the web overall.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	635	1083
Low Latency	266	204

An interesting result - 400ms off the average transaction in the ~250ms RTT environment, but a notable loss in the low latency scenario. All of the numbers here are averages across two test runs, but just inspecting the amazon test case in particular on some other ad-hoc runs showed quite a bit of variability. My suspicion is server load occasionally results in a single resource taking a long time to return. I have seen this disable pipelining for HTML pages, but leave it enabled for images, in the past.

Pct of Responses Rcvd in < Xms
	x=1200	x=900	x=600
Moderate Latency w/Pipeline	91	79	62
Moderate Latency wo/Pipeline	77	69	61

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	93	90	84
Low Latency wo/Pipeline	99	94	85

Flickr

The flickr test is probably the simplest of the cases. It simply loads several galleries based on set names and tags. There are roughly 350 HTTP transactions in the test. Under normal conditions Flickr has a high variability in server response time.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	57
Low Latency	0	57

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	43	814
Low Latency	7	211

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	10	27	12	27
Reused Idle	19	73	19	73
Pipeline	71	0	69	0

As with the other tests, more than half of the new connections have been replaced when pipelining is enabled.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	859	1091
Low Latency	291	366

This result is more modest, but still positive, when compared to our other tests.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	95	87	67
Moderate Latency wo/Pipeline	88	74	60

Pct of Responses Rcvd in < Xms
	x=1000	x=700	x=400
Low Latency w/Pipeline	99	97	72
Low Latency wo/Pipeline	98	93	71

www.AsiaNewsPhoto.com

The test is photo journalism clearing house site located overseas and therefore the broadband low latency case has a starting RTT of closer to 100ms, while the moderate delay case adds 200ms to that. This is the smallest test case - just 175 transactions in each run.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	44
Low Latency	0	38

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	21	726
Low Latency	31	400

I am not yet certain how to explain the very modest rise in queue time for the pipeline case when the added 200ms delay is removed. It must involve an aberrant TCP connection as that is really the only component of queue time when the requests them selves are not delayed due to connection limits.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	16	7	11	8
Reused Idle	33	93	33	92
Pipeline	51	0	56	0

This is the first time we actually see the new connection numbers moving in the wrong direction. In this case I believe the type scheduling restrictions placed on the connection manager are generating new connections that may have been un-necessary in the non pipelining scenario. I'm curious if the effect would fade in a test with more transactions.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	1310	1248
Low Latency	752	689

This seems to track the changes in connection types and maybe the test bears further examination to see if an adjustment can be made. The scheduling algorithm seems to be getting in the way of itself and has made performance just a tick worse than before, though not by very much. And certainly not by enough to discount the gains made in some other scenarios.

Pct of Responses Rcvd in < Xms
	x=2000	x=1500	x=1000
Moderate Latency w/Pipeline	84	82	52
Moderate Latency wo/Pipeline	82	76	54

Pct of Responses Rcvd in < Xms
	x=1500	x=1000	x=500
Low Latency w/Pipeline	88	82	46
Low Latency wo/Pipeline	84	82	57

MapQuest

This test is different in that it is driven almost exclusively through JS and XMLHttpRequest. Those elements are present in the Facebook and Amazon tests as well, but they dominate the MapQuest test. In this scenario a map is brought up on the screen and it is manipulated in the usual ways - panning in 4 directions, zooming in and out, and toggling between satelline and map mode. By the time it is done, 711 HTTP transactions have been made.

Percent of Requests Queued
	With Pipeline	Without Pipeline
Moderate Latency	0	42
Low Latency	0	44

Average Queue Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	52	381
Low Latency	10	198

For all cases, the queue latency is pretty low for this test. That means the number of documents requested in one burst is relatively modest.

Connection Type Pct
	Moderate Latency w/Pipeline	Moderate Latency wo/Pipeline	Low Latency w/Pipeline	Low Latency wo/Pipeline
New	12	16	11	14
Reused Idle	30	84	32	86
Pipeline	58	0	57	0

Marginally less new connections are used with pipelining. Hurrah.

Average Transaction Latency (ms)
	With Pipeline	Without Pipeline
Moderate Latency	677	732
Low Latency	260	500

Pct of Responses Rcvd in < Xms
	x=1500	x=1200	x=900
Moderate Latency w/Pipeline	96	87	76
Moderate Latency wo/Pipeline	96	83	70

Pct of Responses Rcvd in < Xms
	x=900	x=600	x=300
Low Latency w/Pipeline	99	95	65
Low Latency wo/Pipeline	87	75	40

Tuesday, November 30, 2010

Implementing a Pipelined Client for Firefox

I have been working on a set of patches to revamp the Firefox HTTP pipeline implementation. The objective is to provide a fast and robust implementation of pipelines. There are lots of reasons to think this is a good thing.

The lowest level of the current implementation, which is disabled by default, receives a few minor tweaks. The primary change is to implement a continually fillable pipeline instead of the existing one where a pipeline is loaded up to whatever depth the request queue supports and then is not refilled until it has been emptied.

There are much more serious changes on top of that. The most significant is probably captured in the "Select Connection Based on Type and State" bug and patch. This code does four things: 1) it classifies each transaction into a particular type, 2) it keeps track of the history of each server with respect to pipelining, 3) it determines whether or not pipelining is appropriate based on the type of the transaction and the history of the server with that type in the past, and 4) if it decides to pipeline it places it on a pipeline filled with only transactions of the same classification.

That's a lot to think about. But the basic idea is to sort requests into control traffic (i.e. js and css), images, revalidations, htmlish things, and things known to be pipeline inappropriate such as video, most XMLHttpRequest, non-idempotent transactions, etc.. I call the latter classification "solo".

Classes of data such as CSS and revalidations that generally have very short responses, and thus benefit the most from pipelining, favor pipelines over parallel connections even when we have less than the maximum number of connections already open.

A variety of things can plague a successful pipelining session - for instance a site may have some documents with a large processing time on the server and that "think time" can really gum up the pipeline. By segregating the types of documents we can turn off pipelining for whatever is DB driven (e.g. XHTML) while still chasing down deep pipelines for images. Facebook is a terrific example of this.. the contents of your wall or your friend list probably have to be scooped out of a database computation and composed in real time, but the dozens of icons they reference are just whipped quickly right out when you have their direct key.

This concept of negative feedback (in the example above it would be a read of the HTML page with a latency well in excess of the RTT to the server) is what drives the server state. Very large responses, sub HTTP/1.1 responses, cancelled pipelines, closed connections, and server headers known to be associated with broken pipeline implementations all trigger negative feedback too.

Most feedback is tied to the classification of the particular request, but some is applied to all classes on the server. Corrupted data is one such case - if the response contains an invalid MD5 sum, or uses the proposed Assoc-Req header but fails to provide the correct information, or appears to have non-sensical framing information, all requests to that server are prevented from using pipelines for a long time. In the past such corruptions have happened due to buggy or compromised servers and intermediaries.

Much like congestion control, the determination of whether or not to proceed with a pipeline is based on both negative and positive feedback. Initially pipelines are not sent. We must first see a HTTP/1.1 response header from the server. After that has been accomplished we try and send a pipeline of only depth 2 and only on a single connection. If both of those responses are received OK then we transition from that tentative state (known internally as yellow) to a position where each connection can send pipelines of up to depth 4 instead of opening all the way. Even after the depth-of-2 test succeeds we can still not be certain the topology supports pipelines - what appears as a short pipeline at the client may not appear that way at the server due to race conditions inherent in network transfer, but the extension to a depth of 4 for every connection still represents significant additional capacity over a single connection being allowed a depth of 2. During this phase concurrent connections are of course also used, so nothing is slowed down over the historical pipelining disabled setting.

Once a transaction that was sent at at least a depth of 3 is successfully received the depth limits are removed from all connections to that host. The new maximum depth is only governed by a configuration preference with the default of 32. Should one of the negative events mentioned earlier occur, that class of transaction is placed in the red state (i.e. no pipelining is allowed for a period of time to that server) generally without interfering with the other transactions.

As mentioned, an unexpected delay of a few hundred milliseconds is considered to be think time and feedback is applied to keep things running smoothly with only a barely perceptible bump in the road. But a longer delay not only applies feedback for future use but it also cancels any transactions pipelined after the currently delayed one and moves them to new connections. This helps mitigate the head of line blocking problem if it is really severe and as a side effect no more pipelines will happen in the near future with that server. In that sense firefox is self correcting in a hostile environment.

All of this negative information expires over time but while it is still valid firefox will keep it persistently between restarts - its value was hard earned.

XMLHttpRequests are a special source of pain in traditional pipeline implementations. They often implement long polling patterns where a request is sent to the server and the server intentionally hangs until some external event occurs - it then shares that information with the client by forming a response. It is essentially server push from a data point of view implemented with a long polling request. For that reason, XHR is by default classified as solo, but meta data can be supplied in the form of an HTTP request header to allow the application to indicate the request is expected to be fulfilled quickly.

As they say on late night TV, "But Wait! There's more!" Sometimes the pipelining infrastructure problem isn't on the server side, sometimes it is a proxy that is part of the client's topology. To test for that there is a new startup test which initiates a deep pipeline to a known pipeline friendly resource on the Internet. This server intentionally defers its responses until a deep pipeline has been received there and only then leaks a small response. Both sides continue a pipeline on the same connection until it has been confirmed that an entire window of data can be supported by the HTTP devices on that path. The results of this test are cached for a very long time. This transfer also takes the opportunity to share with the client a list of host names that should be blacklisted with respect to pipelining because of known operability problems. (You can still use them - just not pipeline with them). The location (and enablement) of the test and hostname blacklist server are configurable.

All of this feedback and history tracking sounds intimidating, but the truth is that most of the web works just fine. There are enough problems that a feedback driven self correcting implementation helps smooth out the bumps, but most of the web opens right up without any problems.

Mail your comments to me or provide a talk back link.

The Value of HTTP Pipelines

For the past few months I've been on a personal quest to implement a safe and effective HTTP pipelining strategy for Mozilla Firefox. The pipeline concept is part of HTTP/1.1 but for various reasons has not been widely deployed on the web.

I'm going to make a series of three posts. This one will be basic background information - what pipelines are in the abstract and why they are more relevant to today's web than they have been in years. The second post will be about the details of my Firefox implementation and its mechanisms for dealing with the realities of the occasional pipeline-hostile piece of infrastructure. The last post will share some performance metrics of those patches in common use cases.

HTTP Pipelines - 101

A client forms a pipeline simply by sending a second HTTP request down an HTTP connection before it has received the first response. A pipeline may be more than two transactions deep. HTTP responses are still required to be returned in the order their requests arrived.

This is a simple concept, but the benefits are profound.

Chiefly, the pipelined transactions benefit by eliminating the latency involved in sending their request. The greater the latency, the greater the benefit.

This crude diagram shows 2 normal HTTP transactions without pipelining. In it each request is shown with a rightward arrow and the response is shown with a leftward arrow. The empty space represents the network latency. When the first response is received, the second is sent.

When pipelining is used, the picture becomes different. Even though the request and response sizes are the same, and the round trip time between the hosts is the same, the overall transaction completes sooner. It is obvious that the round trip latency between the client and the server has been mitigated as a problem.

The greater the latency and the deeper the pipeline the more benefit is captured.

While bandwidth continues to improve, latency is not keeping up. The trend is toward mobile networks and those have latencies 5x to 20x worse than broadband WAN connections. This increased latency is an opportunity for pipelining.

HTTP Pipelines - 201

Conceptually, parallel HTTP connections can garner similar benefits. Afterall, independent HTTP connections do not need to wait for each other to finish before beginning their work.

Up to a point, that is completely true. Parallelism has been the mainstay for many years and served us well. But it has its limits.

The chief limit is actually specified. No more than 6 (or 4, or 2 depending on the point in history) simultaneous connections are supposed to be allowed from the user agent to the server. This is a serious handicap - Firefox may easily have a request queue of over 100 images, stylesheets, and javascript items to fetch upon parsing a Facebook HTML page full of photos, emoticons, ads, and widgets. A single digit number of connections is far too small to effectively parallelize that download.

This brings us to the reason there is a limit on the number of connections. Making a new TCP connection sucks. Before it can be used at all it requires a high latency three way handshake and if it is over TLS an expensive RSA operation on the server (and yet another round trip). Making the new connection requires access to shared data structures on the server in a way that using an existing connection does not and harms scalability. Servers that can pump multiple gigabits a second of TCP data in their sleep through a few connections, can still only initiate on the order of tens of thousands of connections a second.

Maintaining all those connections is expensive for the server too. Each one takes state (i.e. RAM in the kernel, perhaps a thread stack in the application), and the server now has to deal with sorting through more control blocks and application states everytime a packet comes in in order to match it with the right one. Because of the increased diversity of TCP control blocks in use, the L2/L3 cache is severely polluted which is the number one factor in high performance TCP.

Even if pipelines only mean browser performance equal to parallel connections but accomplished using fewer connections then that is a win for Web scalability.

HTTP Pipelines - 301

The details of TCP start to weigh in heavily in favor of pipelining when parallelism and pipelining are compared.

Most significantly, TCP slow-start performs poorly in the parallelized environment. To recap what you already know: Slow start requires the sender to send a conservative amount of data at first (typically 3 or 4 packets) and then wait a full round trip time to receive positive feedback in the form of an ACK from the recipient. At that time it can grow the window a little bit and send another burst and then wait again. After enough iterations of this the sender is generating enough traffic in each burst to "fill the window" and it no longer has to wait.

Pipelining, of course, does not eliminate slow start. But it does use fewer connections to move the same amount of data and the slow-start penalty scales with the number of connections not the amount of data. Fewer connections mean less data is transferred under the artificially slow conditions of slow start.

There is another wrinkle that applies to even parallel connections that have paid their start-up dues and progressed past slow start. TCP connections that have not sent data within an RTO (think of an RTO as a round trip time plus a grace period) are supposed to go back to slow-start! From a TCP point of view this makes some sense - the inactivity means the TCP stack has lost the ack-clock that it needs to use the connection aggressively. But for even a persistent use of a parallel HTTP connection this is a disaster. Effectively each transaction response must go through slow start because it takes more than a round trip for the last packet of the previous response to travel to the client, the client to form the next request, the request to travel to the server and the server to form the response. Pipelined connections do not have that problem - one response follows immediately on the heels of the previous response which ensures optimal TCP use of the available bandwidth. Huzzah.

Lastly, it is worth talking about Google's push for larger initial windows during slow start. They support a value of 10 instead of the current 3 or 4. Basically I think this is a good thing - the Internet can support larger bursts than we are currently sending. Unfortunately, this has a terrible potential interaction with parallel connections. 6 connections would essentially mean 60 congestion control packet credits available to be sent in a burst at any time. Just as 3 is too few for a single connection environment, 60 is too many for a multiple connection environment. Whereas pipelines reduce the reliance on parallel connections, they bring down the total number of packet credits outstanding at any time while still allowing for more effective slow start capacity probing. That's a win win in my book.

Wednesday, September 1, 2010

Long Out Of Order Queueing Delays

Last week I posted about a kernel patch that records how long out of order TCP packets are kept hidden from userspace while the kernel tries to fill in the necessary holes to create an in-order stream.

These packets are especially frustrating - they have arrived at the host but the application does not have access to them until the kernel can create an in-order stream. Some applications that are really doing messaging over TCP (which might be sensible if you're looking for congestion control, maybe layered security, maybe multiplexing different semantic streams onto one TCP stream and the loss is localized to one of them, etc..) might be capable of moving on with their lives (and their data) more quickly if they had access to the missing sequence numbers. So the question is, how long are these kinds of applications waiting for in-order data when out-of-order data is already at their host?

I ran that code for about 10 days on my desktop which runs typical American broadband with normal RTTs anywhere from 40 to 150ms. The host is even SACK enabled. Here is what I found:

* 38,881 web flows (port 80 or 443)

* 164 flows that contained reordered packets. That is 4.1 per thousand.

* 915K total packets. 18,169 of them reordered. That is 19.8 per thousand.

* The flows with reordering account for just .4% of the flows, but 40% of the total packets. Obviously, the bigger you are the more likely you are to experience a reordering event.. but furthermore small flows sometimes don't reorder at all because any loss that impacts them is more likely to be repaired with an empty window and a timeout.

* If you select for just the .41% of flows that experience reordering a whopping 4.6% of the packets in that flow are reordered on average. Indeed this average flow is pretty large - 2418 packets and 110 of them are delayed due to being out of order. The average size of a flow that did not experience any reordering was just 24 data packets long. The fact that reordering events are such large clusters is probably good news - it likely means that we were seeing big windows of data on the wire and just a small amount of the early packets in that window were lost.. the rest of the window is counted as out-of-order. We want to see big windows in flight - so I'm good with that.

* The average length of a reordered packet is 1424 bytes. Over 97% of these reordered packets are at least 1300 bytes long. This isn't that interesting, but I wrote it down - so here it is.

When I talk about "N packets long" I mean my host received N packets with data in them.. bare acks and control packets are not counted.

So far, that's not too bad. Big flows have this happen all the time. Reordering is basically a pre-requisite for doing any kind of TCP fast recovery in the face of packet loss so it looks good. If we assume that the reordering is due to small packet losses which can be repaired with fast-retransmit algorithms, which seems to make sense, then the out of order problem should be repaired in a little over 1 RTT, right?

Unfortunately, when I plotted the delays incurred they look a lot bigger than the distribution of RTTs I see from my dekstop. A lot bigger.

There is a really long tail on that graph - and it only captures the best 97% of the data points. The longest I saw any packet wait in the reorder queue (and make it out again) was a full 2.5 minutes.

Even though 2.5 minutes is an aberration, the normal cases are still pretty awful. The median time out of order packets spend queued in the kernel while waiting for an in-order stream is 293 milliseconds! Ouch.

Let's zoom in on that graph - this shows the 90% of the packets that waited the shortest amount of time:

That's pretty ugly, you need to budget a full second wait to get 80% of the reordered packets out of their kernel limbo.

It is much much uglier than I expected.

It's not the reordering that bothers me - big reordering runs are to be perfectly expected in the face of a little packet loss and it is good to use that bandwidth while the loss is repaired.

But why is it taking so long to repair?

Friday, August 20, 2010

Characterizing Delays Caused by TCP in-order

If packet N+1 of a TCP flow arrives before packet N, the receiving application does not see any data until packet N gets there. That's what we mean when we say TCP guarantees in-order delivery. That is also true if N+1 through N+100 get there before N - nobody gets through until they can all be delivered in-order.

At least using the BSD socket API.

I got to thinking about the impact of this when discussing a multiplexing implementation of various logical streams on top of TCP. For instance, SPDY and BEEP do things along those lines in order to create efficiencies in terms of more accurate congestion control data. But as someone objected, that creates a certain amount of fate sharing between the different streams that wouldn't exist if they were on separate TCP channels. A packet loss in one of them creates a delay for them both even though throughput might very well be maintained using some variation of fast-retransmit and large windows.

So the question: how often are packets received but the data in them is delayed by he kernel because the stream isn't yet in order? And how long are those delays?

I don't know yet, but I wrote some crude linux kernel patches to find out. When a skb is moved out of the out of order queue a structure with 2 timestamps (in queue, out of queue) is passed to userspace through the netlink connector mechanism. It also reports the total number of received data packets on each TCP stream. That way we can find out, how often and how long.

I'm running the hack on my desktop now.

Tuesday, February 16, 2010

Googling Harder

A while back I mentioned that Google thinks TCP ought to be more aggressive.

I must admit, this matches my own bias. I can barely count the number of applications I have watched wait for network I/O when there was plenty of CPU and idle bandwidth available. It's maddening. Sometimes it's slow start or another aspect of congestion control, sometimes it is outdated things like the nagle algorithm.

Well, Google is back at it with this slide set. (PDF)

They make the argument for increasing the initial cwnd. More provocatively, they argue that the Web has already done so in a defacto way by going to aggressive numbers of independent parallel HTTP connections (where you essentially get new cwnd credits just for opening a new TCP stream). Clever argument. Maybe you want to pace the data after 3 or 4 packets based on the RTT of the handshake - so you don't overrun any buffers un-necessarily.

Frankly, this kind of thing can be implemented on the server side without ever telling the peer. It would make some sense for Google to just do this for a few different values of cwnd on a tiny fraction of their traffic and see if the packet loss rates change and then publish that.

Tuesday, July 21, 2009

Caution - (S)Low Bridge Ahead

This post will not be satisfying. Someone has posted some great datapoints about virtualized packet forwarding, which is great. But they don't make a lot of sense. Which is not great. Nor is it satisfying.

Oh well, I'm sure there will be a followup sometime in the future.

In this thread, Or Gerlitz posts a new networking type for qemu (and by extension) kvm which are of course popular linux host virtualization packages. The networking type is "raw" and the driver couldn't be more simple - a (v)lan interface on the host is opened with a AF_PACKET socket and all of the packets that appear there are shoved through to the guest interface, and vice versa.

This is a pretty direct way of doing things, but it has the unfortunate side effect that all of the guests and the host itself are aggregated onto one upstream switch port without any kind of bridge, switch, or router in between. This means that unless the upstream switch can do a u-turn when forwarding (and most of them will not), all of the guests and the host are isolated from each other. The normal way of doing things is to attach the guests and host together with a tun/tap socket and run a bridge on host. This bridge will do all the necessary forwarding so that everybody has full connectivity, and it lets you run iptables and ebtables on the host to boot.

That's all well and good, but the really interesting part was the motivation for running around tun/tap/bridge anyhow: the poster runs a test with short udp transmissions over gige.. running it between two real (non-vm) hosts he sees 450K packets per second. The post doesn't mention what hardware is involved, so we'll just take it as a black box baseline. Switching the sender to be a qemu guest with traditional tap/bridge networking it plummets to just 195K. The "raw" interface gets that back up to 240K - which is still a far cry from 450, eh?

Tap mode has 3 times the context switches than the raw version. I don't think I saw a number for the non-vm test. Other than that nothing, including the profiles, really jumps out.

The whole thread is worth reading - but the main data points are here and here

Tuesday, July 14, 2009

Google Thinks TCP should be more Aggressive by Default

Really interesting post from Jerry Chu of Google. He says Google has data which shows that we ought to lower the initial RTO, increase the initial CWND, drop the min RTO, and reduced the delayed ack time out in TCP.

Based on my own anecdotal data, I've done stuff like that in products I've worked on. Let's face it - 3 seconds is a freaking eternity. Processors, networks, and busses have all scaled but these constants remain the same. Jerry says Google has data that shows this is important. As the google data set is no doubt much more extensive than any I worked with, that's a really welcome post.

Probably the most important data point Jerry shares is that "up to a few percentage points" in his data set exhibit a SYN-ACK retransmission from the google servers. Wow. (at least) 1 in 50 syn-acks needs to be retransmitted? That's not my experience at all, and if true on google scale it is absolutely fascinating. Are they generally seeing 2% packet loss on google tx? There's no way that they are seeing that.. google would appear to suck! So what's going on... ? Why is syn-ack rexmitted more than anything else? (and I'm assuming they are indeed lost, because otherwise lowering the timeout wouldn't be the right remedy..)

Friday, February 6, 2009

Increasing Upload Speed from Firefox on Windows

Sometimes bugs are more interesting to work on than features - they have that mysterious quality about them and give a satisfying feeling when you figure it out.

This one was brought to my attention by Mark Finkle.

It basically boiled down to HTTP POSTs from Firefox on Windows being slower than they are in Internet Explorer, and also slower than they are in Firefox on OS X or Linux. (IE on windows and the non-windows platforms all perform about the same, with FF on windows lagging behind).

The culprit turned out to be the TCP congestion window. Firefox never had more than 8KB of un-acked data outstanding. If you have a network path with a high bandwidth-delay product, that isn't going to cut it.

Windows (up to and including Vista) has an 8KB default sending window. Or so I found out thanks to Google.

Autotuning that buffer size is standard practice on OS X and Linux and has been for a long while. Vista autotunes the receive buffer (but not XP according to what I read), but the send buffer is a small fixed value. IE, realizing that its a web 2.0 kinda world out there full of User Generated Content, must increase that value from its default - because I can look at the IE tcpdump traces and see >80KB of un-acknowledged data (there would be more, but the max window size is not the limit at whatever value they have it set to) in the same way I do with a trace of Firefox on Linux.

The Linux default is 128KB for any reasonably modern machine.

Fortunately that can be controlled on a system wide basis through a registry preference, or on a per socket basis by setting SO_SNDBUF. I submitted a patch that does the latter if the network.tcp.sendbuffer preference is set - the patch also sets the pref for windows.

If you would suffer from this, I see three options:
1] Wait until my patch (or a later rev of it) ends up in an official build
2] Set the registry property to change it for your whole Windows install - KB 950326.
3] you might be able to build a k3wl binary add-on that does the same thing as my patch in a crazy way. Fame, fortune, and faster flickr and picasa uploads await you.

Saturday, November 8, 2008

DNS Prefetching for Firefox

Recently I implemented patches to implement DNS prefetching in Firefox. I am primarily interested in their impact on Fennec (aka Mobile Firefox), but it looks like they will land first in Firefox 3.1 beta2. The, hopefully final, glitches are being shaken out of the patch now.

Google Chrome has a feature like this too.

DNS resolutions are always dominated by latency instead of bandwidth. Particularly on mobile networks the latencies are very high. That makes them perfect candidates for speculative pre-fetching. The advantage is in the latency improvement - instead of waiting for a hostname lookup when you click a link, do that lookup while you're reading the page the link is embedded in. Because the lookups are so small (generally one runt packet in and out) the cost of any wasted over optimistic lookups really doesn't impact the performance of browsing. Good payoff at low cost, the best of both worlds.

The basic benefit is simple: if you click on a link using a new hostname, you save a round trip time. On some networks this can be a substantial improvement (800ms or more) in responsiveness. Some describe this simply as "figuring out the IP address of every link before you click on it".

The Firefox implementation takes this approach one step further than just pre-resolving anchor href hostnames. It uses the prefetch logic on URLs that are being included in the current document. By this I mean that it uses the prefetch logic on things like images, css, and jscript that are being loaded right away, in addition to anchor links which might be clicked on at a slightly later time.

At first that seems non-sensical. How can you pre-fetch the DNS for something you are fetching right now? Where does the "pre" come in? The answer is not so much in the definition of "pre" as it is in the definition of "right now". Most HTTP User Agents, Firefox being no exception, limit their number of simultaneous connections and hosts. Typical pages embed quite a few objects and it is easy to run into these limits. When this happens the browser queues some of the requests. The Firefox pre-fetch DNS implementation allows those queued requests to overlap the high latency host resolution with whatever transfers might be going on without creating an excess level of parallelization.

While this is just a secondary benefit, it can be meaningful. For example, on the day I grabbed a snapshot of http://planet.mozilla.org/ it required 23 unique DNS resolutions in order to render the base page. Most of these were in img URLs. When loading the page with the prefetch patches, even with a cold cache, 16 of them were either fully completed when needed for the first connection or at least already in progress. The result, measured on an EDGE network, was a 4% overall improvement in page load time. Not bad for something that does not reduce bandwidth consumption in any way.

Configuration

Basically, it just works. You don't need to do anything. But there are a few configurables out there for both browsers and content providers.

First, as a browser you might want no part of this. Fair enough - its your browser. If you set the preference network.dns.disablePrefetch to true the prefetch code will never take effect, no matter what any other configuration is set to.

Furthermore, as a security measure, prefetching of embedded link hostnames is not done from documents loaded over https. If you want to allow it in that context too, just set the preference network.dns.disablePrefetchFromHTTPS to true.

Content providers have a couple neat tricks available too. These are meant to be compatible with Chrome.

For content to opt out of DNS pre-resolution it can be served with the x-dns-prefetch-control HTTP header set to off. The equivalent meta http-equiv element can be used instead of a response header too:
<meta http-equiv="x-dns-prefetch-control" content="off">

Setting content to on will reverse the effect. You can never turn pre-fetching on in a browser that has it disabled by preference, but you can undo the impact of a previous x-dns-prefetch control command. In this way, different content provider policies can apply to different portions of the document.

The last configuraton possibility allows the content provider to force the lookup of a particular hostname without providing an anchor using that name. This is done with the link tag:
<link rel="dns-prefetch" href="http://www.spreadfirefox.com/">

The href attribute can contain a full URL, or just a hostname. Hostname only attributes should preceed the hostname with two slashes:
<link rel="dns-prefetch" href="//www.spreadfirefox.com">

Content providers might use the link notation in a site-wide home page in order to preload hostnames that are widely used throughout the site but perhaps not on the home page.

Sunday, September 28, 2008

Asynchronous DNS lookups on Linux with getaddrinfo_a()

A little while back I posted about a bug in the glibc getaddrinfo() implementation which resulted in many CNAME lookups having to be repeated. At that time I teased a future post on the topic of the little known getaddrinfo_a() function - here it is.

type "man getaddrinfo_a". I expect you will get nothing. That was the case for me. Linux is full of non-portable, under-documented, but very powerful interfaces and tools. The upside of these tools is great - I recently referred to netem and ifb which can be used for easy network shaping, and interfaces like tee(), splice() and epoll() are also hugely powerful, but woefully underutilized. I always get a thrill when I stumble across one of these.

Part of the reason for their low profile is portability. And there are times when that matters - though I think it is cited as a bedrock principle more than is really necessary. I think the larger reason is that some of these techniques lack the documentation, web pages, and references in programmer pop-culture necessary to be ubiquitously useful.

Maybe this post will help getaddrinfo_a find its mojo.

This little jewel is a standard part of libc, and has been for many years - you can be assured that it will be present in the runtime of any distribution of the last several years.

getaddrinfo_a() is an asynchronous interface to the DNS resolution routine - getaddrinfo(). Instead of sitting there blocked while getaddrinfo() does its thing, control is returned to your code immediately and your code is interrupted at a later time with the result when it is complete.

Most folks will realize that this is a common need when dealing with DNS resolution. It is a high latency operation and when processing log files, etc, you often have a need to do a lot of them at a time. The asynchronous interface lets you do them in parallel - other than the waiting-for-the-network time, there is very little CPU or even bandwidth overhead involved in a DNS lookup. As such, it is a perfect thing to do in parallel. You really do get linear scaling.

The best documentation seems to be in the design document from Ulrich Drepper. This closely reflects the reality of what was implemented. Adam Langley also has an excellent blog post with an illustration on how to use it. Actually, the header files are more or less enough info too, if you know that getaddrinfo_a() even exists in the first place.

The good news about the API is that you can submit addresses in big batches with one call.

The bad news about the API is that it offers callback either via POSIX signal handling, or by spawning a new thread and running a caller supplied function on it. My attitude is generally to avoid making signal handling a core part of any application, so that's right out. Having libraries spawn threads is also a little disconcerting, but the fact that that mechanism is used here for the callback is really minor compared to how many threads getaddrinfo_a() spawns internally.

I had assumed that the invocation thread would send a few dns packets out onto the wire and then spawn a single thread listening for and multiplexing the responses.. or maybe the listening thread would send out the requests as well and then multiplex the responses. But reading the code shows it actually creates a pretty sizable thread pool wherein each thread calls and blocks on getaddrinfo().

This is more or less the technique most folks roll together by hand, and it works ok - so it is certainly nice to have predone and ubiquitously available in libc rather than rolling it by hand. And it is ridiculous to code it yourself when you are already linking to a library that does it that way. But it seems to have some room for improvement internally in the future.. if that happens, its nice to know that at least the API for it is settled and upgrades should be seamless.

One last giant caveat - in libc 2.7 on 64 bit builds, getaddrinfo_a() appears to overflow the stack and crash immediately on just about any input. This is because the thread spawned internally is created with a 16KB stack which is not enough to initialize the name resolver when using 64 bit data types. Oy! The fix is easy, but be aware that some users may bump into this until fixed libcs are deployed.