If packet N+1 of a TCP flow arrives before packet N, the receiving application does not see any data until packet N gets there. That's what we mean when we say TCP guarantees in-order delivery. That is also true if N+1 through N+100 get there before N - nobody gets through until they can all be delivered in-order.
At least using the BSD socket API.
I got to thinking about the impact of this when discussing a multiplexing implementation of various logical streams on top of TCP. For instance, SPDY and BEEP do things along those lines in order to create efficiencies in terms of more accurate congestion control data. But as someone objected, that creates a certain amount of fate sharing between the different streams that wouldn't exist if they were on separate TCP channels. A packet loss in one of them creates a delay for them both even though throughput might very well be maintained using some variation of fast-retransmit and large windows.
So the question: how often are packets received but the data in them is delayed by he kernel because the stream isn't yet in order? And how long are those delays?
I don't know yet, but I wrote some crude linux kernel patches to find out. When a skb is moved out of the out of order queue a structure with 2 timestamps (in queue, out of queue) is passed to userspace through the netlink connector mechanism. It also reports the total number of received data packets on each TCP stream. That way we can find out, how often and how long.
I'm running the hack on my desktop now.
Real data and musings on the performance of networks, servers, protocols, and their related folks.
Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts
Friday, August 20, 2010
Sunday, June 28, 2009
'Violation' is so prejorative
Ben Hutchings says:
Thank You Ben!
Abstractions aren't inherently good. They are great if they help you build or maintain things that are otherwise too complex to understand or too tediuos to work on - but we have to vigilantly remember that losing those details also sometimes restricts the quality of what we can build too as somethings are just inherently complex.
>[...] we also have architectural issues in violating layered
> software design
Meanwhile, in the real world, we want to avoid copying data, so an skb doesn't belong to any specific protocol layer.
Thank You Ben!
Abstractions aren't inherently good. They are great if they help you build or maintain things that are otherwise too complex to understand or too tediuos to work on - but we have to vigilantly remember that losing those details also sometimes restricts the quality of what we can build too as somethings are just inherently complex.
Friday, May 1, 2009
VOIP Recorder: Listen Live
I've had the opportunity to add a few new features to my Vonage call recording application, VOIP Recorder.
The most entertaining feature is "Listen Live". That will stream the audio from any active phone call to your desktop in more or less real time. That's neat.
I have also added easy buttons on the "at-a-glance" screen to toggle the recording of an individual call on or off. These buttons compliment the touch tone sequences or Caller-ID based programmable filters that provide similar functionality.
Feedback on the first preview release has started to come in. Generally, it has been quite positive. A few people had trouble with the auto-discover portion of the program. I have made some updates to those algorithms to deal with more topologies and it seems more robust now. If you tried out VOIP Recorder earlier, and had problems auto-discovering your ATA, try and download the new release (revision 1-M or greater) and see if that helps. All accounts have been updated with the new release. If you have a problem please be sure to write me so we can make VOIP Recorder even better.
Also, thanks to an idea from Steve, I have added optional courtesy beeps. These are short beeps played periodically to remind everyone about the call recording. You can configure if they are played and, if so, how often they are played. They are off by default. I like the way they sound - they make a nice alternative to the full "recording" announcement insertion.
Last in the new feature department is the addition of a simple "*" filter which matches everything. This lets you write filters that, for instance, whitelist some specific phone numbers but block everyone else. Thanks to Chad for pointing out that omission.
So there is lots going on in the world of VOIP Recorder. You should check out the new release at http://www.penbaynetworks.com/ - Linux, Macs, and Windows are all supported for recording calls made with Vonage(tm), as well as orchestrating pop-up notifications and call blocking based on Caller-ID information.
The most entertaining feature is "Listen Live". That will stream the audio from any active phone call to your desktop in more or less real time. That's neat.
I have also added easy buttons on the "at-a-glance" screen to toggle the recording of an individual call on or off. These buttons compliment the touch tone sequences or Caller-ID based programmable filters that provide similar functionality.
Feedback on the first preview release has started to come in. Generally, it has been quite positive. A few people had trouble with the auto-discover portion of the program. I have made some updates to those algorithms to deal with more topologies and it seems more robust now. If you tried out VOIP Recorder earlier, and had problems auto-discovering your ATA, try and download the new release (revision 1-M or greater) and see if that helps. All accounts have been updated with the new release. If you have a problem please be sure to write me so we can make VOIP Recorder even better.
Also, thanks to an idea from Steve, I have added optional courtesy beeps. These are short beeps played periodically to remind everyone about the call recording. You can configure if they are played and, if so, how often they are played. They are off by default. I like the way they sound - they make a nice alternative to the full "recording" announcement insertion.
Last in the new feature department is the addition of a simple "*" filter which matches everything. This lets you write filters that, for instance, whitelist some specific phone numbers but block everyone else. Thanks to Chad for pointing out that omission.
So there is lots going on in the world of VOIP Recorder. You should check out the new release at http://www.penbaynetworks.com/ - Linux, Macs, and Windows are all supported for recording calls made with Vonage(tm), as well as orchestrating pop-up notifications and call blocking based on Caller-ID information.
Friday, January 30, 2009
Getting Vonage Caller-ID display notifications on Linux & Mac without a soft phone
(Update - April 2009: See also http://bitsup.blogspot.com/2009/04/recording-calls-made-with-vonage.html and http://www.penbaynetworks.com/ for a one-stop answer to this problem on windows, mac, and linux)
I use vonage. What they really sell you is a POTS<-VOIP->POTS tunnel where they provide you one of the POTS/VOIP bridges that you install in your house in order to bring your old traditional phones on line. They also sell a soft-phone that does not include this bridge, but that isn't what I use.
It's a good service - unmetered calling for the places I call, and it comes with a bunch of phone features for a flat $28/month. The VOIP bits are done with SIP the usual way.
So that's lovely, but by default it doesn't provide any access to the SIP data beyond the POTS bridge and that presents a challenge to unlocking your data.
What I would appreciate would be desktop display notifications of the caller id data when the phone rings. This is pretty standard stuff when dealing with soft phones, but it seems to be a bit trickier in the vonage case.
So I rolled my own for KDE4 and OS X, which are the screens I spend my time staring at.

Step 1: Find the SIP invitations.
The SIP protocol is UDP unicast to the vonage "router". If you install the router (in my case a motorolla vt2142) doing double duty as your broadband gateway router, then it will consume those packets without ever sending them onto your LAN. If they're not on the LAN, then you can't really capture them and display the precious info inside, so a different arrangement is required.
I put the vonage box behind a Linux bridge. The bridge is just a linux box (in this case my file, email, and print server) with 2 interfaces. Those interfaces don't have IP addresses, instead they are brought together into logical interface commonly called br0. do this as: "brctl addbr br0; brctl addif br0 eth0; brctl addif br0 eth1" .. once you have done that the machine will act like an ethernet switch, forwarding packets between interfaces as necessary. You could set it up as an IP router instead, but then you would need different subnets and all manner of other duplicated architecture. The bridge is fine. The server doesn't need an IP address to be a bridge, but it does in order to keep doing those file/print things.. I just ran dhcp as normal on the new br0 interface. Now if you run tcpdump on the eth1 (or more specificlly the interface "behind" the bridge with the vonage device) you will see the vonage traffic crossing the bridge. Reading that data it is easy to see my SIP control runs on UDP port 10000. I hear other routers typically use port 5061.
Step 2: Capture those invitations
Now that you've got access to the SIP data, let's do something with it. I used the NFQUEUE iptables interface. NFQUEUE lets you shunt packets to userspace for filtering while they are still in the network stack. I wrote a simple iptables rule that matches data coming into port 10000 and places those packets into queue number 5061 for consumption by a userspace program: "/sbin/iptables -A FORWARD --protocol udp --dport 10000 -j NFQUEUE --queue-num 5061 -d 192.168.16.0/24"
Step 3: Process the invitations and generate network notifications
I wrote a little C program that runs on the bridge which consumes the packets in the NFQUEUE. For each packet it tries to figure out if this is a SIP invitation and if it is, what is the caller id info. All packets are acknowledged back to netfilter/iptables so they are passed onto the vonage router (which is what makes the phone ring!). If you wanted to do some automatic call blocking, this would be a good place to just drop the invite on the floor and then the phone would never ring.
The producer-nfqueue program is available here.
If a piece of caller-id info is found it is broadcast to the local LAN in two different formats. The first format contains just a magic number to identify the format and the caller id strings. It is sent on UDP port 7651. The second broadcast is in Growl network format. Growl is a daemon commonly used on mac OS X to display system notifications. Anybody running growl with "listen for incoming notifications" and "allow remote application registration" enabled will see a popup as soon as this broadcast takes place.

Step 4: KDE applet.
On my linux KDE4 environment, I wrote a kapplet that used a QSystemTrayIcon overload to listen for the port 7651 broadcasts. The effect is nice, but I would have rather had something gnome/kde cross platform. From doing some reading it appears I can inject something into dbus and knotify4 will pop it up as will gnotify, but I couldn't get that to work easily. It would also be a potential signal to things like pulseaudio to turn down the volume. oh well, maybe next version. The applet is available here.
and now I can be lazy and find out that the ringing phone isn't one I want to answer without having to break my train of thought. mission accomplished?
I use vonage. What they really sell you is a POTS<-VOIP->POTS tunnel where they provide you one of the POTS/VOIP bridges that you install in your house in order to bring your old traditional phones on line. They also sell a soft-phone that does not include this bridge, but that isn't what I use.
It's a good service - unmetered calling for the places I call, and it comes with a bunch of phone features for a flat $28/month. The VOIP bits are done with SIP the usual way.
So that's lovely, but by default it doesn't provide any access to the SIP data beyond the POTS bridge and that presents a challenge to unlocking your data.
What I would appreciate would be desktop display notifications of the caller id data when the phone rings. This is pretty standard stuff when dealing with soft phones, but it seems to be a bit trickier in the vonage case.
So I rolled my own for KDE4 and OS X, which are the screens I spend my time staring at.

Step 1: Find the SIP invitations.
The SIP protocol is UDP unicast to the vonage "router". If you install the router (in my case a motorolla vt2142) doing double duty as your broadband gateway router, then it will consume those packets without ever sending them onto your LAN. If they're not on the LAN, then you can't really capture them and display the precious info inside, so a different arrangement is required.
I put the vonage box behind a Linux bridge. The bridge is just a linux box (in this case my file, email, and print server) with 2 interfaces. Those interfaces don't have IP addresses, instead they are brought together into logical interface commonly called br0. do this as: "brctl addbr br0; brctl addif br0 eth0; brctl addif br0 eth1" .. once you have done that the machine will act like an ethernet switch, forwarding packets between interfaces as necessary. You could set it up as an IP router instead, but then you would need different subnets and all manner of other duplicated architecture. The bridge is fine. The server doesn't need an IP address to be a bridge, but it does in order to keep doing those file/print things.. I just ran dhcp as normal on the new br0 interface. Now if you run tcpdump on the eth1 (or more specificlly the interface "behind" the bridge with the vonage device) you will see the vonage traffic crossing the bridge. Reading that data it is easy to see my SIP control runs on UDP port 10000. I hear other routers typically use port 5061.
Step 2: Capture those invitations
Now that you've got access to the SIP data, let's do something with it. I used the NFQUEUE iptables interface. NFQUEUE lets you shunt packets to userspace for filtering while they are still in the network stack. I wrote a simple iptables rule that matches data coming into port 10000 and places those packets into queue number 5061 for consumption by a userspace program: "/sbin/iptables -A FORWARD --protocol udp --dport 10000 -j NFQUEUE --queue-num 5061 -d 192.168.16.0/24"
Step 3: Process the invitations and generate network notifications
I wrote a little C program that runs on the bridge which consumes the packets in the NFQUEUE. For each packet it tries to figure out if this is a SIP invitation and if it is, what is the caller id info. All packets are acknowledged back to netfilter/iptables so they are passed onto the vonage router (which is what makes the phone ring!). If you wanted to do some automatic call blocking, this would be a good place to just drop the invite on the floor and then the phone would never ring.
The producer-nfqueue program is available here.
If a piece of caller-id info is found it is broadcast to the local LAN in two different formats. The first format contains just a magic number to identify the format and the caller id strings. It is sent on UDP port 7651. The second broadcast is in Growl network format. Growl is a daemon commonly used on mac OS X to display system notifications. Anybody running growl with "listen for incoming notifications" and "allow remote application registration" enabled will see a popup as soon as this broadcast takes place.

Step 4: KDE applet.
On my linux KDE4 environment, I wrote a kapplet that used a QSystemTrayIcon overload to listen for the port 7651 broadcasts. The effect is nice, but I would have rather had something gnome/kde cross platform. From doing some reading it appears I can inject something into dbus and knotify4 will pop it up as will gnotify, but I couldn't get that to work easily. It would also be a potential signal to things like pulseaudio to turn down the volume. oh well, maybe next version. The applet is available here.
and now I can be lazy and find out that the ringing phone isn't one I want to answer without having to break my train of thought. mission accomplished?
Sunday, September 28, 2008
Asynchronous DNS lookups on Linux with getaddrinfo_a()
A little while back I posted about a bug in the glibc getaddrinfo() implementation which resulted in many CNAME lookups having to be repeated. At that time I teased a future post on the topic of the little known getaddrinfo_a() function - here it is.
type "man getaddrinfo_a". I expect you will get nothing. That was the case for me. Linux is full of non-portable, under-documented, but very powerful interfaces and tools. The upside of these tools is great - I recently referred to netem and ifb which can be used for easy network shaping, and interfaces like tee(), splice() and epoll() are also hugely powerful, but woefully underutilized. I always get a thrill when I stumble across one of these.
Part of the reason for their low profile is portability. And there are times when that matters - though I think it is cited as a bedrock principle more than is really necessary. I think the larger reason is that some of these techniques lack the documentation, web pages, and references in programmer pop-culture necessary to be ubiquitously useful.
Maybe this post will help getaddrinfo_a find its mojo.
This little jewel is a standard part of libc, and has been for many years - you can be assured that it will be present in the runtime of any distribution of the last several years.
getaddrinfo_a() is an asynchronous interface to the DNS resolution routine - getaddrinfo(). Instead of sitting there blocked while getaddrinfo() does its thing, control is returned to your code immediately and your code is interrupted at a later time with the result when it is complete.
Most folks will realize that this is a common need when dealing with DNS resolution. It is a high latency operation and when processing log files, etc, you often have a need to do a lot of them at a time. The asynchronous interface lets you do them in parallel - other than the waiting-for-the-network time, there is very little CPU or even bandwidth overhead involved in a DNS lookup. As such, it is a perfect thing to do in parallel. You really do get linear scaling.
The best documentation seems to be in the design document from Ulrich Drepper. This closely reflects the reality of what was implemented. Adam Langley also has an excellent blog post with an illustration on how to use it. Actually, the header files are more or less enough info too, if you know that getaddrinfo_a() even exists in the first place.
The good news about the API is that you can submit addresses in big batches with one call.
The bad news about the API is that it offers callback either via POSIX signal handling, or by spawning a new thread and running a caller supplied function on it. My attitude is generally to avoid making signal handling a core part of any application, so that's right out. Having libraries spawn threads is also a little disconcerting, but the fact that that mechanism is used here for the callback is really minor compared to how many threads getaddrinfo_a() spawns internally.
I had assumed that the invocation thread would send a few dns packets out onto the wire and then spawn a single thread listening for and multiplexing the responses.. or maybe the listening thread would send out the requests as well and then multiplex the responses. But reading the code shows it actually creates a pretty sizable thread pool wherein each thread calls and blocks on getaddrinfo().
This is more or less the technique most folks roll together by hand, and it works ok - so it is certainly nice to have predone and ubiquitously available in libc rather than rolling it by hand. And it is ridiculous to code it yourself when you are already linking to a library that does it that way. But it seems to have some room for improvement internally in the future.. if that happens, its nice to know that at least the API for it is settled and upgrades should be seamless.
One last giant caveat - in libc 2.7 on 64 bit builds, getaddrinfo_a() appears to overflow the stack and crash immediately on just about any input. This is because the thread spawned internally is created with a 16KB stack which is not enough to initialize the name resolver when using 64 bit data types. Oy! The fix is easy, but be aware that some users may bump into this until fixed libcs are deployed.
type "man getaddrinfo_a". I expect you will get nothing. That was the case for me. Linux is full of non-portable, under-documented, but very powerful interfaces and tools. The upside of these tools is great - I recently referred to netem and ifb which can be used for easy network shaping, and interfaces like tee(), splice() and epoll() are also hugely powerful, but woefully underutilized. I always get a thrill when I stumble across one of these.
Part of the reason for their low profile is portability. And there are times when that matters - though I think it is cited as a bedrock principle more than is really necessary. I think the larger reason is that some of these techniques lack the documentation, web pages, and references in programmer pop-culture necessary to be ubiquitously useful.
Maybe this post will help getaddrinfo_a find its mojo.
This little jewel is a standard part of libc, and has been for many years - you can be assured that it will be present in the runtime of any distribution of the last several years.
getaddrinfo_a() is an asynchronous interface to the DNS resolution routine - getaddrinfo(). Instead of sitting there blocked while getaddrinfo() does its thing, control is returned to your code immediately and your code is interrupted at a later time with the result when it is complete.
Most folks will realize that this is a common need when dealing with DNS resolution. It is a high latency operation and when processing log files, etc, you often have a need to do a lot of them at a time. The asynchronous interface lets you do them in parallel - other than the waiting-for-the-network time, there is very little CPU or even bandwidth overhead involved in a DNS lookup. As such, it is a perfect thing to do in parallel. You really do get linear scaling.
The best documentation seems to be in the design document from Ulrich Drepper. This closely reflects the reality of what was implemented. Adam Langley also has an excellent blog post with an illustration on how to use it. Actually, the header files are more or less enough info too, if you know that getaddrinfo_a() even exists in the first place.
The good news about the API is that you can submit addresses in big batches with one call.
The bad news about the API is that it offers callback either via POSIX signal handling, or by spawning a new thread and running a caller supplied function on it. My attitude is generally to avoid making signal handling a core part of any application, so that's right out. Having libraries spawn threads is also a little disconcerting, but the fact that that mechanism is used here for the callback is really minor compared to how many threads getaddrinfo_a() spawns internally.
I had assumed that the invocation thread would send a few dns packets out onto the wire and then spawn a single thread listening for and multiplexing the responses.. or maybe the listening thread would send out the requests as well and then multiplex the responses. But reading the code shows it actually creates a pretty sizable thread pool wherein each thread calls and blocks on getaddrinfo().
This is more or less the technique most folks roll together by hand, and it works ok - so it is certainly nice to have predone and ubiquitously available in libc rather than rolling it by hand. And it is ridiculous to code it yourself when you are already linking to a library that does it that way. But it seems to have some room for improvement internally in the future.. if that happens, its nice to know that at least the API for it is settled and upgrades should be seamless.
One last giant caveat - in libc 2.7 on 64 bit builds, getaddrinfo_a() appears to overflow the stack and crash immediately on just about any input. This is because the thread spawned internally is created with a 16KB stack which is not enough to initialize the name resolver when using 64 bit data types. Oy! The fix is easy, but be aware that some users may bump into this until fixed libcs are deployed.
Thursday, September 11, 2008
The minutia of getaddrinfo() and 64 bits
I have been spending some time recently improving the network behavior of Firefox in mobile (i.e. really high latency, sort of low bandwidth) environments.
The manifestations du jour of that are some improvements to the mozilla DNS system.
In that pursuit, I was staring at some packet traces of the DNS traffic generated from my 64 bit linux build (using my LAN instead of the slow wireless net) and I saw this gem:
That is the same request (and response) duplicated. Doing it an extra time appears to cost all of .3ms on my LAN, but on a cell phone that could delay resolution (and therefore page load) time by a full second - very noticeable lag.
I started by combing through the firefox DNS code looking for the bug I assumed I had accidentially put in the caching layer. But I confirmed there was just one call to libc's getaddrinfo() being made for that name.
Then I figured it was some kind of large truncated DNS record from blogspot which necessitated a refetch. Looking further into it the record was really quite normal. The response was just 127 bytes and exactly the same each time - it contained 2 response records: one A record and one CNAME record. Both had reasonably sized names.
I found the same pattern with another CNAME too: 2 out of 6.
And so the debugging began in earnest. Cliff Stoll was looking for his 75 cents, and I was out to find my extra round trip time.
I did not find an international conspiracy, but after whipping together a debuggable libc build I determined that the DNS answer parser placed a "struct host" and some scratch space for parsing into a buffer passed into it from lower on the stack. If the answer parser couldn't parse the response in that space an "ERANGE" error was returned and the caller would increase the buffer size and try again. But the try again involved the whole network dance again instead of just the parsing.
So what I was seeing was that the original buffer of 512 bytes was too small, but a second try with 1024 worked fine. Fair enough, it just seems like an undersized default to fail at such a common case.
And then it made sense. For most of its life, it hasn't been undersized - while the DNS response hasn't change the "struct host" did when I went to 64 bit libraries. struct host is comprised of 48 pointers and a 16 byte buffer. On a 32 bit arch that's 208 bytes, but with 8 byte pointers it is 400. With a 512 byte ceiling, 400 is a lot to give up.
64 bit has a variety of advantages and disadvantages, but an extra RTT was a silent-penalty I hadn't seen before.
This patch fixes things up nicely.
This is good concrete opportunity to praise the pragmatism of developing on open source. It is not about the bug (everyone has them, if this even is one - it is borderline), it is about the transparency. If this was a closed OS, the few avenues available to me would have been incredibly onerous and possibly expensive. Instead I was able to resolve it in an afternoon by myself (and mail off the patch to the outstanding glibc development team).
At some future time, I'll have a similarly thrilling story about the little known but widely deployed getaddrinfo_a().
The manifestations du jour of that are some improvements to the mozilla DNS system.
In that pursuit, I was staring at some packet traces of the DNS traffic generated from my 64 bit linux build (using my LAN instead of the slow wireless net) and I saw this gem:
That is the same request (and response) duplicated. Doing it an extra time appears to cost all of .3ms on my LAN, but on a cell phone that could delay resolution (and therefore page load) time by a full second - very noticeable lag.
I started by combing through the firefox DNS code looking for the bug I assumed I had accidentially put in the caching layer. But I confirmed there was just one call to libc's getaddrinfo() being made for that name.
Then I figured it was some kind of large truncated DNS record from blogspot which necessitated a refetch. Looking further into it the record was really quite normal. The response was just 127 bytes and exactly the same each time - it contained 2 response records: one A record and one CNAME record. Both had reasonably sized names.
I found the same pattern with another CNAME too: 2 out of 6.
And so the debugging began in earnest. Cliff Stoll was looking for his 75 cents, and I was out to find my extra round trip time.
I did not find an international conspiracy, but after whipping together a debuggable libc build I determined that the DNS answer parser placed a "struct host" and some scratch space for parsing into a buffer passed into it from lower on the stack. If the answer parser couldn't parse the response in that space an "ERANGE" error was returned and the caller would increase the buffer size and try again. But the try again involved the whole network dance again instead of just the parsing.
So what I was seeing was that the original buffer of 512 bytes was too small, but a second try with 1024 worked fine. Fair enough, it just seems like an undersized default to fail at such a common case.
And then it made sense. For most of its life, it hasn't been undersized - while the DNS response hasn't change the "struct host" did when I went to 64 bit libraries. struct host is comprised of 48 pointers and a 16 byte buffer. On a 32 bit arch that's 208 bytes, but with 8 byte pointers it is 400. With a 512 byte ceiling, 400 is a lot to give up.
64 bit has a variety of advantages and disadvantages, but an extra RTT was a silent-penalty I hadn't seen before.
This patch fixes things up nicely.
This is good concrete opportunity to praise the pragmatism of developing on open source. It is not about the bug (everyone has them, if this even is one - it is borderline), it is about the transparency. If this was a closed OS, the few avenues available to me would have been incredibly onerous and possibly expensive. Instead I was able to resolve it in an afternoon by myself (and mail off the patch to the outstanding glibc development team).
At some future time, I'll have a similarly thrilling story about the little known but widely deployed getaddrinfo_a().
Friday, May 30, 2008
Firefox Add-Ons - webfolder webDAV add-on updated for Firefox 3
I have had reason to do a little server side WebDAV work these days.
WebDav clients for Linux aren't all that common. There is a filesystem gateway (davfs), and cadaver. Cadaver is a decent command line app; like ncftp.
And more recently, I was introduced to webfolders. This is a addon for firefox that does a perfectly good job of creating a file manager interface for DAV sites.
Point one of this post: webfolders is cool - download it yourself.
Point two of this post: webfolders as listed on the website does not support firefox 3. And of course you are using firefox 3.
Point three of this post: you really ought to be using firefox 3 - it is much faster than firefox 2.
Point four of this post: I want it all - so I have done the trivial work of updating webfolders for firefox 3. Just changed a few constants and paths, and life is good. It is here for download (any OS!)
WebDav clients for Linux aren't all that common. There is a filesystem gateway (davfs), and cadaver. Cadaver is a decent command line app; like ncftp.
And more recently, I was introduced to webfolders. This is a addon for firefox that does a perfectly good job of creating a file manager interface for DAV sites.
Point one of this post: webfolders is cool - download it yourself.
Point two of this post: webfolders as listed on the website does not support firefox 3. And of course you are using firefox 3.
Point three of this post: you really ought to be using firefox 3 - it is much faster than firefox 2.
Point four of this post: I want it all - so I have done the trivial work of updating webfolders for firefox 3. Just changed a few constants and paths, and life is good. It is here for download (any OS!)
Friday, April 18, 2008
Measuring performance of Linux Kernel likely() and unlikely()
A little while back I wrote about how prominent likely() and unlikely() are in the Linux kernel, and yet I could not find any performance measurements linked to them.
Today I made some measurements myself.
But first a quick review - likely and unlikely are just macros for gcc's __builtin_expect(), which in turn allows the compiler to generate code compatible with the target architecture's branch prediction scheme. The GCC documentation really warns against using this manually too often:
My methodology was simple, I choose several benchmarks commonly used in kernel land and I ran them against vanilla 2.6.25 and also against a copy I called "notlikely" which simply had the macros nullified using this piece of programming genius:
The tests I ran were lmbench, netperf, bonnie++, and the famous "how fast can I compile the kernel?" test.
The test hardware was an all 64 bit setup on a 2.6Ghz core-2 duo with 2GB of ram and a SATA disk. Pretty standard desktop hardware.
The core 2 architecture has a pretty fine internal branch prediction engine without the help of these external hints. But with such extensive use of the macros (3500+ times!), I expected to see some difference shown by the numbers.
But I didn't see any measurable difference. Not at all.
Not a single one of those tests showed anything that I wouldn't consider overlapping noise. I had 3 data points for each test on each kernel (6 points per test) and each test had several different facets. Out of dozens of different facets, there wasn't a single criteria where the measurement was always better or worse on one kernel.
And this disappoints me. Because I like micro optimizations damn it! And in general this one seems to be a waste of time other than the nice self documenting code it produces. Perhaps the gcc advice is correct. Perhaps the Core-2 is so good that this doesn't matter. Perhaps there is a really compelling benchmark that I'm just not running.
I say it is a waste in general because I am sure there are specific circumstances and code paths where this makes a measurable difference. There certainly must be a benchmark that can show it - but none of these broad based benchmarks were able to show anything useful. That doesn't mean the macro is over used, it seems harmless enough too, but it probably isn't worth thinking too hard about it either.
hmm.
Today I made some measurements myself.
But first a quick review - likely and unlikely are just macros for gcc's __builtin_expect(), which in turn allows the compiler to generate code compatible with the target architecture's branch prediction scheme. The GCC documentation really warns against using this manually too often:
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect. The kernel certainly makes liberal use of it. Accroding to LXR 2.6.24 had 1608 uses of likely and 2075 uses of unlikely in the code. LXR didn't have an index of the just released 2.6.25 yet - but I'd bet it is likely to be more now.My methodology was simple, I choose several benchmarks commonly used in kernel land and I ran them against vanilla 2.6.25 and also against a copy I called "notlikely" which simply had the macros nullified using this piece of programming genius:
The tests I ran were lmbench, netperf, bonnie++, and the famous "how fast can I compile the kernel?" test.
The test hardware was an all 64 bit setup on a 2.6Ghz core-2 duo with 2GB of ram and a SATA disk. Pretty standard desktop hardware.
The core 2 architecture has a pretty fine internal branch prediction engine without the help of these external hints. But with such extensive use of the macros (3500+ times!), I expected to see some difference shown by the numbers.
But I didn't see any measurable difference. Not at all.
Not a single one of those tests showed anything that I wouldn't consider overlapping noise. I had 3 data points for each test on each kernel (6 points per test) and each test had several different facets. Out of dozens of different facets, there wasn't a single criteria where the measurement was always better or worse on one kernel.
And this disappoints me. Because I like micro optimizations damn it! And in general this one seems to be a waste of time other than the nice self documenting code it produces. Perhaps the gcc advice is correct. Perhaps the Core-2 is so good that this doesn't matter. Perhaps there is a really compelling benchmark that I'm just not running.
I say it is a waste in general because I am sure there are specific circumstances and code paths where this makes a measurable difference. There certainly must be a benchmark that can show it - but none of these broad based benchmarks were able to show anything useful. That doesn't mean the macro is over used, it seems harmless enough too, but it probably isn't worth thinking too hard about it either.
hmm.
Labels:
algorithms,
characterization,
hardware,
linux,
performance
Monday, April 14, 2008
Monitoring IP changes with NETLINK ROUTE sockets
Yesterday I read a mailing list query asking how to get event driven Linux IP address changes (i.e. without having to poll for them).
I agreed with the attitude of the post. The most important thing about scaling is to make sure the work your code does is proportional to the real event stream. Seems obvious enough, but lots of algorithms screw up that basic premise.
Any time based polling algorithm betrays this scaling philosophy because work is done every tick independent of the events to be processed. You're always either doing unnecessary work or adding latency to real work by waiting for the next tick. The select() and poll() APIs also betray it as these are proportional to the amount of potential work (number of file descriptors) instead of the amount of real work (number of active descriptors) - epoll() is a better answer there.
Event driven is the way to go.
Anyhow, back to the original poster. I knew netlink route sockets could do this - and I had used them in the past for similar purposes. I had to get "man 7 netlink"and google going to cobble together an example and only then did I realize how difficult it is to get started with netlink - it just has not been very widely used and documented.
So the point of this post is to provide a little google juice documentation for event driven monitoring of new IPv4 addresses using netlink. At least this post has a full example - the code is below.
If you need to use this functionality - I recommend man 3 and 7 of both netlink and rtnetlink.. and then go read the included header files, and use my sample as a guide. In this basic way you can get address adds, removals, link state changes, route changes, interface changes, etc.. lots of good stuff. It is at the heart of the iproute tools (ip, ss, etc..) as well most of the userspace routing software (zebra, xorp, vyatta, etc..).
I agreed with the attitude of the post. The most important thing about scaling is to make sure the work your code does is proportional to the real event stream. Seems obvious enough, but lots of algorithms screw up that basic premise.
Any time based polling algorithm betrays this scaling philosophy because work is done every tick independent of the events to be processed. You're always either doing unnecessary work or adding latency to real work by waiting for the next tick. The select() and poll() APIs also betray it as these are proportional to the amount of potential work (number of file descriptors) instead of the amount of real work (number of active descriptors) - epoll() is a better answer there.
Event driven is the way to go.
Anyhow, back to the original poster. I knew netlink route sockets could do this - and I had used them in the past for similar purposes. I had to get "man 7 netlink"and google going to cobble together an example and only then did I realize how difficult it is to get started with netlink - it just has not been very widely used and documented.
So the point of this post is to provide a little google juice documentation for event driven monitoring of new IPv4 addresses using netlink. At least this post has a full example - the code is below.
If you need to use this functionality - I recommend man 3 and 7 of both netlink and rtnetlink.. and then go read the included header files, and use my sample as a guide. In this basic way you can get address adds, removals, link state changes, route changes, interface changes, etc.. lots of good stuff. It is at the heart of the iproute tools (ip, ss, etc..) as well most of the userspace routing software (zebra, xorp, vyatta, etc..).
Saturday, April 5, 2008
Linux Selective Acknowledgment (SACK) CPU Overhead
Last year I tossed an e-mail from the Linux kernel networking list in my "projtodo" folder.
The mail talked about how the Linux TCP stack in particular, and likely all TCP stacks in general, likely had an excessive-CPU attack exposure when confronted with malicious SACK options. I found the mail intriguing but unsatisfying. It was well informed speculation but didn't have any hard data, nor was there any simple way to gather some. Readers of the other posts on this blog will know I really dig measurements. The issue at hand was pretty obviously is a problem - but how much of one?
A few weeks ago I had the chance to develop some testing code and find out for myself - and IBM DeveloperWorks has published the summary of my little project. The executive summary is "its kinda bad, but not a disaster, and hope is on the way". There is had data and some pretty pictures in the article itself.
The coolest part of the whole endeavor, other than scratching the "I wonder" itch, was getting to conjure up a userspace TCP stack from raw sockets. It was, of course, woefully incomplete as it was just meant to trigger a certain behavior in its peer instead of being generally useful or reliable - but nonetheless entertaining.
The mail talked about how the Linux TCP stack in particular, and likely all TCP stacks in general, likely had an excessive-CPU attack exposure when confronted with malicious SACK options. I found the mail intriguing but unsatisfying. It was well informed speculation but didn't have any hard data, nor was there any simple way to gather some. Readers of the other posts on this blog will know I really dig measurements. The issue at hand was pretty obviously is a problem - but how much of one?
A few weeks ago I had the chance to develop some testing code and find out for myself - and IBM DeveloperWorks has published the summary of my little project. The executive summary is "its kinda bad, but not a disaster, and hope is on the way". There is had data and some pretty pictures in the article itself.
The coolest part of the whole endeavor, other than scratching the "I wonder" itch, was getting to conjure up a userspace TCP stack from raw sockets. It was, of course, woefully incomplete as it was just meant to trigger a certain behavior in its peer instead of being generally useful or reliable - but nonetheless entertaining.
Labels:
algorithms,
characterization,
congestion control,
linux,
tcp
Tuesday, March 4, 2008
Calgary IOMMU - At What Price?
The Calgary IOMMU is a feature of most IBM X-Series (i.e. X86_64) blades and motherboards. If you aren't familiar with an IOMMU, it is strongly analogous to a regular MMU but applied to a DMA context. Their original primary use was for integrating 32 bit hardware into 64 bit systems. But another promising use for them is enforcing safe access in the same way an MMU can.
In normal userspace, if a program goes off the rails and accesses some memory it does not have permissions for a simple exception can be raised. This keeps the carnage restricted to the application that made the mistake. But if a hardware device does the same thing through DMA, whole system disaster is sure to follow as nothing prevents the accesses from happening. The IOMMU can provide that safety.
An IOMMU unit lets the kernel setup mappings much like normal memory page tables. Normal RAM mappings are cached with TLB entries, and IOMMU maps are cached TCE entries that play largely the same role.
By now, I've probably succeeded in rehashing what you already knew. At least it was just three paragraphs (well, now four).
The pertinent bit from a characterization standpoint is a paper from the 2007 Ottawa Linux Symposium. In The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda of IBM and some co-authors from Intel and AMD do some measurements using the Calgary IOMMU, as well as the DART (which generally comes on Power based planers).
I love measurements! And it takes guts to post measurements like this - in its current incarnation on Linux the cost of safety from the IOMMU is a 30% to 60% increase in CPU! Gah!
Some drill down is required, and it turns out this is among the worst cases to measure. But still - 30 to 60%! The paper is short and well written, you should read it for yourself - but I will summarize the test more or less as "measure the CPU utilization while doing 1 Gbps of netperf network traffic - measure with and without iommu". The tests are also done with and without Xen, as IOMMU techniques are especially interesting to virtualization, but the basic takeaways are the same in virtualized or bare metal environments.
The "Why so Bad" conclusion is management of the TCE. The IOMMU, unlike the TLB cache of an MMU, only allows software to remove entries via a "flush it all" instruction. I have certainly measured that when TLBs need to be cleared during process switching that can be a very measurable event on overall system performance - it is one reason while an application broken into N threads runs faster than the same application broken into N processes.
But overall, this is actually an encouraging conclusion - hardware may certainly evolve to give more granular access to the TCE tables. And there are games that can be played on the management side in software that can reduce the number of flushes in return for giving up some of the safety guarantees.
Something to be watched.
In normal userspace, if a program goes off the rails and accesses some memory it does not have permissions for a simple exception can be raised. This keeps the carnage restricted to the application that made the mistake. But if a hardware device does the same thing through DMA, whole system disaster is sure to follow as nothing prevents the accesses from happening. The IOMMU can provide that safety.
An IOMMU unit lets the kernel setup mappings much like normal memory page tables. Normal RAM mappings are cached with TLB entries, and IOMMU maps are cached TCE entries that play largely the same role.
By now, I've probably succeeded in rehashing what you already knew. At least it was just three paragraphs (well, now four).
The pertinent bit from a characterization standpoint is a paper from the 2007 Ottawa Linux Symposium. In The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda of IBM and some co-authors from Intel and AMD do some measurements using the Calgary IOMMU, as well as the DART (which generally comes on Power based planers).
I love measurements! And it takes guts to post measurements like this - in its current incarnation on Linux the cost of safety from the IOMMU is a 30% to 60% increase in CPU! Gah!
Some drill down is required, and it turns out this is among the worst cases to measure. But still - 30 to 60%! The paper is short and well written, you should read it for yourself - but I will summarize the test more or less as "measure the CPU utilization while doing 1 Gbps of netperf network traffic - measure with and without iommu". The tests are also done with and without Xen, as IOMMU techniques are especially interesting to virtualization, but the basic takeaways are the same in virtualized or bare metal environments.
The "Why so Bad" conclusion is management of the TCE. The IOMMU, unlike the TLB cache of an MMU, only allows software to remove entries via a "flush it all" instruction. I have certainly measured that when TLBs need to be cleared during process switching that can be a very measurable event on overall system performance - it is one reason while an application broken into N threads runs faster than the same application broken into N processes.
But overall, this is actually an encouraging conclusion - hardware may certainly evolve to give more granular access to the TCE tables. And there are games that can be played on the management side in software that can reduce the number of flushes in return for giving up some of the safety guarantees.
Something to be watched.
Labels:
characterization,
hardware,
linux,
performance,
recommendations
Thursday, June 14, 2007
Disk Drive Failure Characterization
I've admitted previously that I have a passion for characterization. When you really understand something you can be sure you are targetting the right problem, and the only way to do that with any certainty is data. Sometimes you've got to guess and make educated inferences, but way too many people guess when they should be measuring instead.
Val Henson highlights on lwn.net a couple great hard drive failure rate characterization studies presented at the USENIX File Systems and Storage Technology Conference. They cast doubt on a couple pieces of conventional wisdom: hard drive infant mortality rates, and the effect on ambient temperature on drive lifetime. This isn't gospel: every characterization study is about a particular frame of reference, but it is still very very interesting. Val Henson, as usual, does a fabulous job interpreting and showing us the most interesting stuff going on in the storage and file systems world.
Val Henson highlights on lwn.net a couple great hard drive failure rate characterization studies presented at the USENIX File Systems and Storage Technology Conference. They cast doubt on a couple pieces of conventional wisdom: hard drive infant mortality rates, and the effect on ambient temperature on drive lifetime. This isn't gospel: every characterization study is about a particular frame of reference, but it is still very very interesting. Val Henson, as usual, does a fabulous job interpreting and showing us the most interesting stuff going on in the storage and file systems world.
Labels:
characterization,
disk,
google,
linux,
lwn,
performance
Subscribe to:
Comments (Atom)