At 3:30 in the morning of January 10th, 2008, a shrill chirping woke up our system administrator, Michael Gorsuch, asleep at home in Brooklyn. It was a text message from Nagios, our network monitoring software, warning him that something was wrong.
He swung out of bed, accidentally knocking over (and waking up) the dog, sleeping soundly in her dog bed, who, angrily, staggered out to the hallway, peed on the floor, and then returned to bed. Meanwhile Michael logged onto his computer in the other room and discovered that one of the three data centers he runs, in downtown Manhattan, was unreachable from the Internet.
This particular data center is in a secure building in downtown Manhattan, in a large facility operated by Peer 1. It has backup generators, several days of diesel fuel, and racks and racks of batteries to keep the whole thing running for a few minutes while the generators can be started. It has massive amounts of air conditioning, multiple high speed connections to the Internet, and the kind of “right stuff” down-to-earth engineers who always do things the boring, plodding, methodical way instead of the flashy cool trendy way, so everything is pretty reliable.
Internet providers like Peer 1 like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like “99.99% uptime.” When you do the math, let’s see, there are 525,949 minutes in a year (or 525,600 if you are in the cast of Rent), so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty, but honestly, it’s often rather trivial… like, you get your money back for the minutes they were down. I remember once getting something like $10 off the bill once from a T1 provider because of a two day outage that cost us thousands of dollars. SLAs can be a little bit meaningless that way, and given how low the penalties are, a lot of network providers just started advertising 100% uptime.
Within 10 minutes everything seemed to be back to normal, and Michael went back to sleep.
Until about 5:00 a.m. This time Michael called the Peer 1 Network Operations Center (NOC) in Vancouver. They ran some tests, started investigating, couldn’t find anything wrong, and by 5:30 a.m. things seemed to be back to normal, but by this point, he was as nervous as a porcupine in a balloon factory.
At 6:15 a.m. the New York site lost all connectivity. Peer 1 couldn’t find anything wrong on their end.
Michael got dressed and took the subway into Manhattan. The server seemed to be up. The Peer1 network connection was fine. The problem was something with the network switch. Michael temporarily took the switch out of the loop, connecting our router directly to Peer 1’s router, and lo and behold, we were back on the Internet.
By the time most of our American customers got to work in the morning, everything was fine. Our European customers had already started emailing us to complain. Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. There are several possible speeds that a switch can use to communicate (10, 100, or 1000 megabits/second). You can either set the speed manually, or you can let the switch automatically negotiate the highest speed that both sides can work with. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn’t.
Michael knew this could be a problem, but when he installed the switch, he had forgotten to set the speed, so the switch was still in the factory-default autonegotiate mode, which seemed to work fine. Until it didn’t.
Michael wasn’t happy. He sent me an email:
I know that we don’t officially have an SLA for On Demand, but I would like us to define one for internal purposes (at least). It’s one way that I can measure if myself and the (eventual) sysadmin team are meeting the general goals for the business. I was in the slow process of writing up a plan for this, but want to expedite in light of this morning’s mayhem.
An SLA is generally defined in terms of ‘uptime’, so we need to define what ‘uptime’ is in the context of On Demand. Once that is made clear, it’ll get translated into policy, which will then be translated into a set of monitoring / reporting scripts, and will be reviewed on a regular interval to see if we are ‘doing what we say’.
Good idea!
But there are some problems with SLAs. The biggest one is the lack of statistical meaningfulness when outages are so rare. We’ve had, if I remember correctly, two unplanned outages, including this one, since going live with FogBugz on Demand six months ago. Only one was our fault. Most well-run online services will have two, maybe three outages a year. With so few data points, the length of the outage starts to become really significant, and that’s one of those things that’s wildly variable. Suddenly, you’re talking about how long it takes a human to get to the equipment and swap out a broken part. To get really high uptime, you can’t wait for a human to switch out failed parts. You can’t even wait for a human to figure out what went wrong: you have to have previously thought of every possible thing that can possibly go wrong, which is vanishingly improbable. It’s the unexpected unexpecteds, not the expected unexpecteds, that kill you.
Really high availability becomes extremely costly. The proverbial “six nines” availability (99.9999% uptime) means no more than 30 seconds downtime per year. That’s really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar superduper ultra-redundant six nines system are gonna wake up one day, I don’t know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they’ll smack their heads and have fourteen days of outage.
Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you’ve just blown your downtime budget for the next century. Even the most notoriously reliable systems, like AT&T’s long distance service, have had long outages (six hours in 1991) which put them at a rather embarrassing three nines … and AT&T’s long distance service is considered “carrier grade,” the gold standard for uptime.
Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: “A black swan is an outlier, an event that lies beyond the realm of normal expectations.” Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They’re the kind of things that happen so rarely it doesn’t even make sense to use normal statistical methods like “mean time between failure.” What’s the “mean time between catastrophic floods in New Orleans?”
Measuring the number of minutes of downtime per year does not predict the number of minutes of downtime you’ll have the next year. It reminds me of commercial aviation today: the NTSB has done such a great job of eliminating all the common causes of crashes that nowadays, each commercial crash they investigate seems to be a crazy, one-off, black-swan outlier.
Somewhere between the “extremely unreliable” level of service, where it feels like stupid outages occur again and again and again, and the “extremely reliable” level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there’s a sweet spot, where all the expected unexpecteds have been taken care of. A single hard drive failure, which is expected, doesn’t take you down. A single DNS server failure, which is expected, doesn’t take you down. But the unexpected unexpecteds might. That’s really the best we can hope for.
To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.
Since this fit well with our idea of fixing everything two ways, we decided to start using five whys ourselves. Here’s what Michael came up with:
- Our link to Peer1 NY went down
- Why? – Our switch appears to have put the port in a failed state
- Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
- Why? – The switch interface was set to auto-negotiate instead of being manually configured
- Why? – We were fully aware of problems like this, and have been for many years. But – we do not have a written standard and verification process for production switch configurations.
- Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.
“Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred,” Michael wrote. “Or, it would occur once, and the standard would get updated as appropriate.”
After some internal discussion we all agreed that rather than imposing a statistically meaningless measurement and hoping that the mere measurement of something meaningless would cause it to get better, what we really needed was a process of continuous improvement. Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we’re doing to prevent that problem in the future. In this case, the change is that our internal documentation will include detailed checklists for all operational procedures in the live environment.
Our customers can look at the blog to see what caused the problems and what we’re doing to make things better, and, hopefully, they can see evidence of steadily improving quality.
In the meantime, our customer service folks have the authority to credit customers’ accounts if they feel like they were affected by an outage. We let the customer decide how much they want to be credited, up to a whole month, because not every customer is even going to notice the outage, let alone suffer from it. I hope this system will improve our reliability to the point where the only outages we suffer are really the extremely unexpected black swans.

PS. Yes, we want to hire another system administrator so Michael doesn’t have to be the only one to wake up in the middle of the night.
This year when Neil approached me about co-sponsoring the conference, I thought, why not? It’s exactly the kind of conference I would organize if I were organizing a conference about the software business, which, thankfully, I’m not, but Neil is, and he’s doing a bang up job.
JavaSchools
I despaired of finding a company to work for where programmers were treated like talent and not like typists, and decided I would have to start my own. In those days, I was seeing lots of really dumb people with really dumb business plans making internet companies, and I thought, hey, if I can be, say, 10% less dumb than them, that should be easy, maybe I can make a company too, and in my company, we’d do things right for a change. We’d treat programmers with respect, we’d make high quality products, we wouldn’t take any shit from VCs or 24-year-olds playing President, we’d care about our customers and solve their problems when they called, instead of blaming everything on Microsoft, and we’d let our customers decide whether or not to pay us. At Fog Creek we’ll give anyone their money back with no questions asked under any circumstances whatsoever. Keeps us honest.
What I do on Joel on Software—writing articles about somewhat technical topics—is something I learned here in the CS department, too. Here’s the story behind that. In 1989 Yale was pretty good at AI, and one of the big name professors,
And despite the fact that CS115 didn’t count towards the major, all this experience writing about slightly technical topics turned out to be the most useful thing I got out of the CS department. Being able to write clearly on technical topics is the difference between being a grunt individual contributor programmer and being a leader. My first job at Microsoft was as a program manager on the Excel team, writing the technical specification for this huge programming system called Visual Basic for Applications. This document was something like 500 pages long, and every morning literally hundreds of people came into work and read my spec to figure out what to do next. That included programmers, testers, marketing people, documentation writers, and localizers around the world. I noticed that the really good program managers at Microsoft were the ones who could write really well. Microsoft flipped its corporate strategy 180 degrees based on a single compelling email that Steve Sinofsky wrote called
After a few years in Redmond, Washington, during which I completely failed to adapt to my environment, I beat a hasty retreat to New York City. I stayed on with Microsoft in New York for a few months, where I was a complete and utter failure as a consultant at Microsoft Consulting, and then I spent a few years in the mid-90s, when the Internet was first starting to happen, at Viacom. That’s this big corporate conglomerate which owned MTV, VH1, Nickelodeon, Blockbuster, Paramount Studios, Comedy Central, CBS, and a bunch of other entertainment companies. New York was the first place I got to see what most computer programmers do for a living. It’s this scary thing called “in house software.” It’s terrifying. You never want to do in house software. You’re a programmer for a big corporation that makes, oh, I don’t know, aluminum cans, and there’s nothing quite available off the shelf which does the exact kind of aluminum can processing that they need, so they have these in-house programmers, or they hire companies like Accenture and IBM to send them overpriced programmers, to write this software. And there are two reasons this is so frightening: one, because it’s not a very fulfilling career if you’re a programmer, for a list of reasons which I’ll enumerate in a moment, but two, it’s frightening because this is what probably 80% of programming jobs are like, and if you’re not very, very careful when you graduate, you might find yourself working on in-house software, by accident, and let me tell you, it can drain the life out of you.
Number one. You never get to do things the right way. You always have to do things the expedient way. It costs so much money to hire these programmers—typically a company like Accenture or IBM would charge $300 an hour for the services of some recent Yale PoliSci grad who took a 6 week course in dot net programming, and who is earning $47,000 a year and hoping that it’ll provide enough experience to get into business school—anyway, it costs so much to hire these programmers that you’re not going to allowed to build things with Ruby on Rails no matter how cool Ruby is and no matter how spiffy the Ajax is going to be. You’re going into Visual Studio, you’re going to click on the wizard, you’re going to drag the little Grid control onto the page, you’re going to hook it up to the database, and presto, you’re done. It’s good enough. Get out of there and onto the next thing. That’s the second reason these jobs suck: as soon as your program gets good enough, you have to stop working on it. Once the core functionality is there, the main problem is solved, there is absolutely no return-on-investment, no business reason to make the software any better. So all of these in house programs look like a dog’s breakfast: because it’s just not worth a penny to make them look nice. Forget any pride in workmanship or craftsmanship you learned in
Number three: when you’re a programmer at a software company, the work you’re doing is directly related to the way the company makes money. That means, for one thing, that management cares about you. It means you get the best benefits and the nicest offices and the best chances for promotion. A programmer is never going to rise to become CEO of Viacom, but you might well rise to become CEO of a tech company.
Juno was, allegedly, supported by advertising. It turned out that advertising to the kinds of people who won’t pay $20 a month for AOL is not exactly the most lucrative business in the world, so in reality, Juno was supported by rich investors. But at least Juno was a product company where programmers were held in high regard, and I felt good about their mission to provide email to everyone. And indeed I worked there happily for about three years as a C++ programmer. Eventually, though, I started to discover that the management philosophy at Juno was
I graduated with a B.S. in Computer Science in 1991. Sixteen years ago. What I’m going to try to do today is relate my undergraduate years in the CS department to my career, which consists of developing software, writing about software, and starting a software company. And of course that’s a little bit absurd; there’s a famous part at the beginning of MIT’s Introduction to Computer Science where
For a moment there, I actually thought I’d get a PhD. Both my parents are professors. So many of their friends were academics that I grew up assuming that all adults eventually got PhDs. In any case, I was thinking pretty seriously of going on to graduate school in Computer Science. Until I tried to take a class in Dynamic Logic right here in this very department. It was taught by
And when all was said and done, she got to the end of the proof, and somehow was getting exactly the opposite result of the one that made sense, until that same graduate student pointed out where, 63 steps earlier, some bit had been accidentally flipped due to a little bit of dirt on the board, and all was well.
You will frequently hear the claim that software engineering is facing a quality crisis of some sort. I don’t happen to agree with that claim—the computer software most people use most of the time is of ridiculously high quality compared to everything else in their lives—but that’s beside the point. This claim about the “quality crisis” leads to a lot of proposals and research about making higher quality software. And at this point, the world divides into the geeks and the suits.
Now, the geeks are interested in the narrowly technical aspects of quality. They focus on things they can see in the code, rather than waiting for the users to judge. They’re programmers, so they try to automate everything in their life, and of course they try to automate the QA process. This is how you get unit testing, which is not a bad thing, don’t get me wrong, and it’s how you get all these attempts to mechanically “prove” that a program is “correct.” The trouble is that anything that can’t be automated has to be thrown out of the definition of quality. Even though we know that users prefer software that looks cooler, there’s no automated way to measure how cool looking a program is, so that gets left out of the automated QA process.
So, the bottom line is that if there really were a mechanical way to prove things about the correctness of a program, all you’d be able to prove is whether that program is identical to some other program that must contain the same amount of entropy as the first program, otherwise some of the behaviors are going to be undefined, and thus unproven. So now the spec writing is just as hard as writing a program, and all you’ve done is moved one problem from over here to over there, and accomplished nothing whatsoever.
I don’t hate Microsoft, really I don’t. In fact, my first job out of school was actually at Microsoft. In those days it was not really a respectable place to work. Sort of like taking a job in the circus. People looked at you funny. Really? Microsoft? On campus, in particular, it was perceived as corporate, boring, buttoned-down, making inferior software so that accountants can do, oh I don’t know, spreadsheets or whatever it is that accountants do. Perfectly miserable. And it all ran on a pathetic single-tasking operating system called MS-DOS full of arbitrary stupid limitations like 8-character file names and no email and no telnet and no Usenet. Well, MS-DOS is long gone, but the cultural