Posts by Joel Spolsky

In 2000 I co-founded Fog Creek Software, where we created lots of cool things like the FogBugz bug tracker, Trello, and Glitch. I also worked with Jeff Atwood to create Stack Overflow and served as CEO of Stack Overflow from 2010-2019. Today I serve as the chairman of the board for Stack Overflow, Glitch, and HASH.

January 22, 2008 by Joel Spolsky

Five whys

Fog Creek Software, News

At 3:30 in the morning of January 10th, 2008, a shrill chirping woke up our system administrator, Michael Gorsuch, asleep at home in Brooklyn. It was a text message from Nagios, our network monitoring software, warning him that something was wrong.

He swung out of bed, accidentally knocking over (and waking up) the dog, sleeping soundly in her dog bed, who, angrily, staggered out to the hallway, peed on the floor, and then returned to bed. Meanwhile Michael logged onto his computer in the other room and discovered that one of the three data centers he runs, in downtown Manhattan, was unreachable from the Internet.

This particular data center is in a secure building in downtown Manhattan, in a large facility operated by Peer 1. It has backup generators, several days of diesel fuel, and racks and racks of batteries to keep the whole thing running for a few minutes while the generators can be started. It has massive amounts of air conditioning, multiple high speed connections to the Internet, and the kind of “right stuff” down-to-earth engineers who always do things the boring, plodding, methodical way instead of the flashy cool trendy way, so everything is pretty reliable.

Internet providers like Peer 1 like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like “99.99% uptime.” When you do the math, let’s see, there are 525,949 minutes in a year (or 525,600 if you are in the cast of Rent), so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty, but honestly, it’s often rather trivial… like, you get your money back for the minutes they were down. I remember once getting something like $10 off the bill once from a T1 provider because of a two day outage that cost us thousands of dollars. SLAs can be a little bit meaningless that way, and given how low the penalties are, a lot of network providers just started advertising 100% uptime.

Within 10 minutes everything seemed to be back to normal, and Michael went back to sleep.

Until about 5:00 a.m. This time Michael called the Peer 1 Network Operations Center (NOC) in Vancouver. They ran some tests, started investigating, couldn’t find anything wrong, and by 5:30 a.m. things seemed to be back to normal, but by this point, he was as nervous as a porcupine in a balloon factory.

At 6:15 a.m. the New York site lost all connectivity. Peer 1 couldn’t find anything wrong on their end. Michael got dressed and took the subway into Manhattan. The server seemed to be up. The Peer1 network connection was fine. The problem was something with the network switch. Michael temporarily took the switch out of the loop, connecting our router directly to Peer 1’s router, and lo and behold, we were back on the Internet.

By the time most of our American customers got to work in the morning, everything was fine. Our European customers had already started emailing us to complain. Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. There are several possible speeds that a switch can use to communicate (10, 100, or 1000 megabits/second). You can either set the speed manually, or you can let the switch automatically negotiate the highest speed that both sides can work with. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn’t.

Michael knew this could be a problem, but when he installed the switch, he had forgotten to set the speed, so the switch was still in the factory-default autonegotiate mode, which seemed to work fine. Until it didn’t.

Michael wasn’t happy. He sent me an email:

I know that we don’t officially have an SLA for On Demand, but I would like us to define one for internal purposes (at least). It’s one way that I can measure if myself and the (eventual) sysadmin team are meeting the general goals for the business. I was in the slow process of writing up a plan for this, but want to expedite in light of this morning’s mayhem.

An SLA is generally defined in terms of ‘uptime’, so we need to define what ‘uptime’ is in the context of On Demand. Once that is made clear, it’ll get translated into policy, which will then be translated into a set of monitoring / reporting scripts, and will be reviewed on a regular interval to see if we are ‘doing what we say’.

Good idea!

But there are some problems with SLAs. The biggest one is the lack of statistical meaningfulness when outages are so rare. We’ve had, if I remember correctly, two unplanned outages, including this one, since going live with FogBugz on Demand six months ago. Only one was our fault. Most well-run online services will have two, maybe three outages a year. With so few data points, the length of the outage starts to become really significant, and that’s one of those things that’s wildly variable. Suddenly, you’re talking about how long it takes a human to get to the equipment and swap out a broken part. To get really high uptime, you can’t wait for a human to switch out failed parts. You can’t even wait for a human to figure out what went wrong: you have to have previously thought of every possible thing that can possibly go wrong, which is vanishingly improbable. It’s the unexpected unexpecteds, not the expected unexpecteds, that kill you.

Really high availability becomes extremely costly. The proverbial “six nines” availability (99.9999% uptime) means no more than 30 seconds downtime per year. That’s really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar superduper ultra-redundant six nines system are gonna wake up one day, I don’t know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they’ll smack their heads and have fourteen days of outage.

Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you’ve just blown your downtime budget for the next century. Even the most notoriously reliable systems, like AT&T’s long distance service, have had long outages (six hours in 1991) which put them at a rather embarrassing three nines … and AT&T’s long distance service is considered “carrier grade,” the gold standard for uptime.

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: “A black swan is an outlier, an event that lies beyond the realm of normal expectations.” Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They’re the kind of things that happen so rarely it doesn’t even make sense to use normal statistical methods like “mean time between failure.” What’s the “mean time between catastrophic floods in New Orleans?”

Measuring the number of minutes of downtime per year does not predict the number of minutes of downtime you’ll have the next year. It reminds me of commercial aviation today: the NTSB has done such a great job of eliminating all the common causes of crashes that nowadays, each commercial crash they investigate seems to be a crazy, one-off, black-swan outlier.

Somewhere between the “extremely unreliable” level of service, where it feels like stupid outages occur again and again and again, and the “extremely reliable” level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there’s a sweet spot, where all the expected unexpecteds have been taken care of. A single hard drive failure, which is expected, doesn’t take you down. A single DNS server failure, which is expected, doesn’t take you down. But the unexpected unexpecteds might. That’s really the best we can hope for.

To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.

Since this fit well with our idea of fixing everything two ways, we decided to start using five whys ourselves. Here’s what Michael came up with:

Our link to Peer1 NY went down
Why? – Our switch appears to have put the port in a failed state
Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
Why? – The switch interface was set to auto-negotiate instead of being manually configured
Why? – We were fully aware of problems like this, and have been for many years. But – we do not have a written standard and verification process for production switch configurations.
Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.

“Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred,” Michael wrote. “Or, it would occur once, and the standard would get updated as appropriate.”

After some internal discussion we all agreed that rather than imposing a statistically meaningless measurement and hoping that the mere measurement of something meaningless would cause it to get better, what we really needed was a process of continuous improvement. Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we’re doing to prevent that problem in the future. In this case, the change is that our internal documentation will include detailed checklists for all operational procedures in the live environment.

Our customers can look at the blog to see what caused the problems and what we’re doing to make things better, and, hopefully, they can see evidence of steadily improving quality.

In the meantime, our customer service folks have the authority to credit customers’ accounts if they feel like they were affected by an outage. We let the customer decide how much they want to be credited, up to a whole month, because not every customer is even going to notice the outage, let alone suffer from it. I hope this system will improve our reliability to the point where the only outages we suffer are really the extremely unexpected black swans.

PS. Yes, we want to hire another system administrator so Michael doesn’t have to be the only one to wake up in the middle of the night.

January 20, 2008 by Joel Spolsky

This year’s Business of Software conference

News

The best conference I went to last year was the Business of Software Conference, organized by Neil Davidson of Red Gate Software over in Cambridge, England. It had a great lineup of speakers [PDF] including Guy Kawasaki, Eric Sink, Tim Lister (coauthor of Peopleware), Rick Chapman, Hugh MacLeod, and others, a great collection of real software businesses in attendance, and almost no fluff. Plus, my first 5.6 earthquake, experienced from the top of a highrise hotel. Great fun.

This year when Neil approached me about co-sponsoring the conference, I thought, why not? It’s exactly the kind of conference I would organize if I were organizing a conference about the software business, which, thankfully, I’m not, but Neil is, and he’s doing a bang up job.

So this year, it’s going to be called “Business of Software 2008: A Joel on Software Conference.” It’ll almost certainly be in Boston some time in the fall, but nothing is even remotely final yet. Sign up for the mailing list at that site, and they’ll let you know when a time and place are set.

January 8, 2008 by Joel Spolsky

Undergraduate programming

News

From CrossTalk, The Journal of Defense Software Engineering: “It is our view that Computer Science (CS) education is neglecting basic skills, in particular in the areas of programming and formal methods. We consider that the general adoption of Java as a first programming language is in part responsible for this decline.”

JavaSchools are not operating in a vacuum: they’re dumbing down their curriculum because they think it’s the only way to keep CS students. The real problem is that these schools are not doing anything positive to attract the kids who are really interesting in programming, not computer science. I think the solution would be to create a programming-intensive BFA in Software Development–a Julliard for programmers. Such a program would consist of a practical studio requirement developing significant works of software on teams with very experienced teachers, with a sprinkling of liberal arts classes for balance. It would be a huge magnet to the talented high school kids who love programming, but can’t get excited about proving theorums.

When I said BFA, Bachelor of Fine Arts, I meant it: software development is an art, and the existing Computer Science education, where you’re expected to learn a few things about NP completeness and Quicksort is singularly inadequate to training students how to develop software.

Imagine instead an undergraduate curriculum that consists of 1/3 liberal arts, and 2/3 software development work. The teachers are experienced software developers from industry. The studio operates like a software company. You might be able to major in Game Development and work on a significant game title, for example, and that’s how you spend most of your time, just like a film student spends a lot of time actually making films and the dance students spend most of their time dancing.

There are already several programs going in this direction: a lot of Canadian universities, notably Waterloo, have Software Engineering programs, and in Indiana, Rose-Hulman combines a good software engineering program with a co-op program called Rose-Hulman Ventures. These programs have no problem attracting lots of qualified students at a time when the Ivy League CS departments consider themselves lucky if they get a dozen majors a year.

In the meantime, think about how many computer science departments earned their reputation by writing an important piece of code: MIT’s X Window, Athena, and Lisp Machine; CMU’s Andrew File System, Mach, and Lycos; Berkeley’s Unix; the University of Kansas’ Lynx; Columbia’s Kermit. Where are those today? What have the universities given us lately? What’s the best college for a high school senior who really loves programming but isn’t so excited about lambda calculus?

January 5, 2008 by Joel Spolsky

Voting machine bugs

News

Clive Thompson, writing in The New York Times: “In 2005, the state of California complained that the machines were crashing. In tests, Diebold determined that when voters tapped the final “cast vote” button, the machine would crash every few hundred ballots. They finally intuited the problem: their voting software runs on top of Windows CE, and if a voter accidentally dragged his finger downward while touching “cast vote” on the screen, Windows CE interpreted this as a “drag and drop” command. The programmers hadn’t anticipated that Windows CE would do this, so they hadn’t programmed a way for the machine to cope with it. The machine just crashed.”

December 24, 2007 by Joel Spolsky

Travel Tips

News

Inc.com published my list of travel tips from the World Tour. You’ll learn how we completely avoided air travel snafus, what equipment we brought along, and more.

December 24, 2007 by Joel Spolsky

Getting Lookout to run on Outlook 2007 again

News

The search feature in Microsoft Outlook 2007, frankly, sucks big time.

It’s slow. Searches take about 30 seconds for me. (I have about 10 years of email).

You have to wait for it to fail to find things in your inbox before you’re permitted to search elsewhere, even if you know the message isn’t in your inbox.

The search quality is atrocious. I regularly get 50% garbage results mixed in that have nothing in common with my search terms, and the message I am looking for often doesn’t come up.

The patch helps, but it still takes around 30 seconds to do a search.

It didn’t use to be that way. A few years ago, there was a great add-in called Lookout for Outlook, based on Lucene.NET. Searches always took less than a second.

The tiny company that made Lookout was bought by Microsoft. It must have been one of those HR acquisitions, because the Lookout technology was thrown away. Mike Belshe only spent a couple of years at Microsoft before moving on.

When Outlook 2007 came out, it disabled Lookout, and allegedly this wasn’t supposed to be a big deal because Outlook 2007 has search “built in.” But the built-in search is, as mentioned, ghastly.

Last week I had finally had enough. I can’t work like this. I spent some time searching on the net and found that the original author of Lookout, Mike Belshe, had just posted instructions for getting Lookout to work on Outlook 2007.

They worked! Lookout is back!

It’s fast! The first search takes about a second. After that something seems to be cached in memory and further searches appear as fast as you hit the “enter” key.

December 6, 2007 by Joel Spolsky

Where there’s muck, there’s brass

Product manager, News

When I was a kid working in the bread factory, my nemesis was dough. It was sticky and hard to remove and it got everywhere. I got home with specks of dough in my hair. Every shift included a couple of hours of scraping dough off of machinery. I carried dough-scrapers in my back pocket. Sometimes a huge lump of dough would go flying someplace where it shouldn’t and gum up everything. I had dough nightmares.

I worked in the production side of the factory. The other side did packing and shipping. Their nemesis was crumbs. Crumbs got everywhere. The shipping crew went home with crumbs in their hair. Every shift included a couple of hours of brushing crumbs out of machinery. They carried little brushes in their back pockets. I’m sure they had crumb nightmares, too.

Pretty much any job that you can get paid for includes dealing with one gnarly problem. If you don’t have dough or crumbs to deal with, maybe you work in a razor blade factory and go home with little cuts all over your fingers. Maybe you work for VMWare and have nightmares about emulating bugs in sophisticated video cards that games rely on. Maybe you work on Windows, and your nightmare is that the simplest change can cause millions of old programs and hardware devices to stop working. That’s the gnarly part of your job.

One of our gnarly problems is getting FogBugz to run on our customers’ own servers. Jason Fried over at 37signals has a good summary of why this is no fun: “You have to deal with endless operating environment variations that are out of your control. When something goes wrong it’s a lot harder to figure out why if you aren’t in control of the OS or the third party software or hardware that may be interfering with the install, upgrade, or general performance of your product. This is even more complicated with remote server installs when there may be different versions of Ruby, Rails, MYSQL, etc. at play.” Jason concludes that if they had to sell installable software, they “definitely wouldn’t be as happy.” Yep. Work that makes you unhappy is what I mean by “a gnarly problem.”

The trouble is, the market pays for solutions to gnarly problems, not solutions to easy problems. As the Yorkshire lads say, “Where there’s muck, there’s brass.”

We offer both kinds of FogBugz–hosted and installable–and our customers opt 4 to 1 to install it at their own site. For us, the installable option gives us five times the sales. It costs us an extra salary or two (in tech support costs). It also means we have to use Wasabi, which has some serious disadvantages compared to off-the-shelf programming languages, but which we found to be the most cost-effective and efficient way, given our code base, to ship software that is installable on Windows, Linux, and Mac. Boy, I would love nothing more than to scrap installable FogBugz and run everything on our servers… we’ve got racks and racks of nice, well-managed Dell servers with plenty of capacity and our tech support costs for the hosted version are zero. Life would be much easier. But we’d be making so much less money we’d be out of business.

The one thing that so many of today’s cute startups have in common is that all they have is a simple little Ruby-on-Rails Ajax site that has no barriers to entry and doesn’t solve any gnarly problems. So many of these companies feel insubstantial and fluffy, because, out of necessity (the whole company is three kids and an iguana), they haven’t solved anything difficult yet. Until they do, they won’t be solving problems for people. People pay for solutions to their problems.

Making an elegantly-designed and easy-to-use application is just as gnarly, even though, like good ballet, it seems easy when done well. Jason and 37signals put effort into good design and get paid for that. Good design seems like the easiest thing to copy, but, watching Microsoft trying to copy the iPod, turns out to be not-so-easy. Great design is a gnarly problem, and can actually provide surprisingly sustainable competitive advantage.

Indeed Jason probably made a good choice by picking the gnarly problem where he has a lot of talent (design) to solve, because it doesn’t seem like a chore to him. I’ve been a Windows programmer for ages, so making a Windows Setup program for FogBugz, from scratch in C++ doing all kinds of gnarly COM stuff, doesn’t seem like a chore to me.

The only way to keep growing–as a person and as a company–is to keep expanding the boundaries of what you’re good at. At some point, the 37signals team might decide that hiring one person to write the Setup script and do installation support would pay for itself, and generate substantially more profit than it costs. So unless they deliberately want to keep the company small, which is a perfectly legitimate desire, they might eventually lose their reluctance to do things that seem gnarly.

Or maybe they won’t. There’s nothing wrong with choosing the fun part of your business to work on. I’ve certainly been guilty of that. And there’s nothing wrong with deciding that you only want to solve a specific set of problems for a small, select group of people. Salesforce has managed to become big enough by sticking to hosted software. And there are plenty of smaller software shops providing a fantastic lifestyle for their crew with no desire to get any bigger.

But the great thing is that as you solve each additional gnarly problem, your business and market grows substantially. Good marketing, good design, good sales, good support, and solving lots of problems for customers all amplify each other. You start out with good design, then you add some good features and triple your customer base by solving lots of problems, and then you do some marketing and triple your customer base again because now lots of people learn about your solution to their pain, and then you hire sales people and triple your customer base yet again because now the people who know about your solution are reminded to actually buy it, and then you add more features to solve more problems for even more people, and eventually you actually have a chance to reach enough people with your software to make the world a better place.

P.S. I’m not claiming here that 37signals would sell 5 times as many copies if they offered Installable Basecamp. First of all, one of the reasons we may sell so many more installable versions of FogBugz is that it appears, to some customers, to be cheaper. (It’s not cheaper in the long run because you have to pay for the server and administer it yourself, but that’s subjective.) Also, our support costs for the installable version are only as low as they are because 80% of our customers opt to run on Windows Server. Because Windows systems are so similar, it’s much easier for us to support the lowest common denominator. The vast majority of our tech support costs are caused by the diversity in Unix platforms out there–I’d guess that the 20% of our Unix sales result in 80% of our support incidents. If an installable version of Basecamp required Unix, the support cost would be disproportionately expensive compared to a hypothetical installable Windows version. Finally, another reason our experience might not translate to 37signals is that we’ve been selling installable software for seven years now; the hosted version has only been out for about six months. So we have a big installed base used to running FogBugz on their own servers. If you only look at new FogBugz customers, the ratio of installable to hosted goes down to 3 to 1.

December 5, 2007 by Joel Spolsky

Talk at Yale: Part 3 of 3

New developer, News

This is part three of the text of a talk delivered to the Yale Computer Science department on November 28. Part one and part two already appeared.

I despaired of finding a company to work for where programmers were treated like talent and not like typists, and decided I would have to start my own. In those days, I was seeing lots of really dumb people with really dumb business plans making internet companies, and I thought, hey, if I can be, say, 10% less dumb than them, that should be easy, maybe I can make a company too, and in my company, we’d do things right for a change. We’d treat programmers with respect, we’d make high quality products, we wouldn’t take any shit from VCs or 24-year-olds playing President, we’d care about our customers and solve their problems when they called, instead of blaming everything on Microsoft, and we’d let our customers decide whether or not to pay us. At Fog Creek we’ll give anyone their money back with no questions asked under any circumstances whatsoever. Keeps us honest.

So, it was the summer of 2000, and I had taken some time off from work while I hatched the plans for Fog Creek Software and went to the beach a lot. During that period I started writing up some of the things I had learned over the course of my career on a website called Joel on Software. In those early days before blogs were invented, a programmer named Dave Winer had set up a system called EditThisPage.com where anyone could post things to the web in a sort-of blog like format. Joel on Software grew quickly and gave me a pulpit where I could write about software development and actually get some people to pay attention to what I was saying. The site consists of fairly unoriginal thoughts, combined with jokes. It was successful because I used a slightly larger font than the average website, making it easy to read. It’s always hard to figure out how many people read the site, especially when you don’t bother counting them, but typical articles on that site get read by somewhere between 100,000 and a million people, depending on how popular the topic is.

What I do on Joel on Software—writing articles about somewhat technical topics—is something I learned here in the CS department, too. Here’s the story behind that. In 1989 Yale was pretty good at AI, and one of the big name professors, Roger Schank, came and gave a little talk at Hillel about some of his AI theories about scripts and schemas and slots and all that kind of stuff. Now essentially, I suspect from reading his work that it was the same speech he’d been giving for twenty years, and he had spent twenty years of his career writing little programs using these theories, presumably to test them, and they didn’t work, but somehow the theories never got discarded. He did seem like a brilliant man, and I wanted to take a course with him, but he was well known for hating undergraduates, so the only option was to take this course called Algorithmic Thinking—CS115—basically, a watered-down gut group IV class designed for poets. It was technically in the CS department, but the faculty was so completely unimpressed that you were not allowed to count it towards a CS major. Although it was the largest class by enrollment in the CS department, I cringed every time I heard my history-major friends referring to the class as “computer science.” A typical assignment was to write an essay on whether machines can think or not. You can see why we weren’t allowed to count it towards a CS degree. In fact, I would not be entirely surprised if you revoke my degree today, retroactively, upon learning that I took this class.

The best thing about Algorithmic Thinking was that you had to write a lot. There were 13 papers—one every week. You didn’t get grades. Well, you did. Well, ok, there’s a story there. One of the reasons Schank hated undergrads so much was that they were obsessed with grades. He wanted to talk about whether computers could think and all undergrads wanted to talk about was why their paper got a B instead of an A. At the beginning of the term, he made a big speech about how grades are evil, and decided that the only grade you could get on a paper was a little check mark to signify that some grad student read it. Over time, he wanted to recognize the really good papers, so he added check-PLUS, and then there were some really lame papers, so he started giving out check-minuses, and I think I got a check-plus-plus once. But grades: never.

And despite the fact that CS115 didn’t count towards the major, all this experience writing about slightly technical topics turned out to be the most useful thing I got out of the CS department. Being able to write clearly on technical topics is the difference between being a grunt individual contributor programmer and being a leader. My first job at Microsoft was as a program manager on the Excel team, writing the technical specification for this huge programming system called Visual Basic for Applications. This document was something like 500 pages long, and every morning literally hundreds of people came into work and read my spec to figure out what to do next. That included programmers, testers, marketing people, documentation writers, and localizers around the world. I noticed that the really good program managers at Microsoft were the ones who could write really well. Microsoft flipped its corporate strategy 180 degrees based on a single compelling email that Steve Sinofsky wrote called Cornell is Wired. The people who get to decide the terms of the debate are the ones who can write. The C programming language took over because The C Programming Language was such a great book.

So anyway, those were the highlights of CS. CS 115, in which I learned to write, one lecture in Dynamic Logic, in which I learned not to go to graduate school, and CS 322, in which I learned the rites and rituals of the Unix church and had a good time writing a lot of code. The main thing you don’t learn with a CS degree is how to develop software, although you will probably build up certain muscles in your brain that may help you later if you decide that developing software is what you want to do. The other thing you can do, if you want to learn how to develop software, is send your resume to [email protected], and apply for a summer internship, and we’ll teach you a thing or two about the subject.

Thank you very much for your time.

December 4, 2007 by Joel Spolsky

Talk at Yale: Part 2 of 3

New developer, News

This is part two of the text of a talk delivered to the Yale Computer Science department on November 28. Part one appeared yesterday.

After a few years in Redmond, Washington, during which I completely failed to adapt to my environment, I beat a hasty retreat to New York City. I stayed on with Microsoft in New York for a few months, where I was a complete and utter failure as a consultant at Microsoft Consulting, and then I spent a few years in the mid-90s, when the Internet was first starting to happen, at Viacom. That’s this big corporate conglomerate which owned MTV, VH1, Nickelodeon, Blockbuster, Paramount Studios, Comedy Central, CBS, and a bunch of other entertainment companies. New York was the first place I got to see what most computer programmers do for a living. It’s this scary thing called “in house software.” It’s terrifying. You never want to do in house software. You’re a programmer for a big corporation that makes, oh, I don’t know, aluminum cans, and there’s nothing quite available off the shelf which does the exact kind of aluminum can processing that they need, so they have these in-house programmers, or they hire companies like Accenture and IBM to send them overpriced programmers, to write this software. And there are two reasons this is so frightening: one, because it’s not a very fulfilling career if you’re a programmer, for a list of reasons which I’ll enumerate in a moment, but two, it’s frightening because this is what probably 80% of programming jobs are like, and if you’re not very, very careful when you graduate, you might find yourself working on in-house software, by accident, and let me tell you, it can drain the life out of you.

OK, so, why does it suck to be an in house programmer.

Number one. You never get to do things the right way. You always have to do things the expedient way. It costs so much money to hire these programmers—typically a company like Accenture or IBM would charge $300 an hour for the services of some recent Yale PoliSci grad who took a 6 week course in dot net programming, and who is earning $47,000 a year and hoping that it’ll provide enough experience to get into business school—anyway, it costs so much to hire these programmers that you’re not going to allowed to build things with Ruby on Rails no matter how cool Ruby is and no matter how spiffy the Ajax is going to be. You’re going into Visual Studio, you’re going to click on the wizard, you’re going to drag the little Grid control onto the page, you’re going to hook it up to the database, and presto, you’re done. It’s good enough. Get out of there and onto the next thing. That’s the second reason these jobs suck: as soon as your program gets good enough, you have to stop working on it. Once the core functionality is there, the main problem is solved, there is absolutely no return-on-investment, no business reason to make the software any better. So all of these in house programs look like a dog’s breakfast: because it’s just not worth a penny to make them look nice. Forget any pride in workmanship or craftsmanship you learned in CS323. You’re going to churn out embarrassing junk, and then, you’re going to rush off to patch up last year’s embarrassing junk which is starting to break down because it wasn’t done right in the first place, twenty-seven years of that and you get a gold watch. Oh, and they don’t give gold watches any more. 27 years and you get carpal tunnel syndrome. Now, at a product company, for example, if you’re a software developer working on a software product or even an online product like Google or Facebook, the better you make the product, the better it sells. The key point about in-house development is that once it’s “good enough,” you stop. When you’re working on products, you can keep refining and polishing and refactoring and improving, and if you work for Facebook, you can spend a whole month optimizing the Ajax name-choosing gizmo so that it’s really fast and really cool, and all that effort is worthwhile because it makes your product better than the competition. So, the number two reason product work is better than in-house work is that you get to make beautiful things.

Number three: when you’re a programmer at a software company, the work you’re doing is directly related to the way the company makes money. That means, for one thing, that management cares about you. It means you get the best benefits and the nicest offices and the best chances for promotion. A programmer is never going to rise to become CEO of Viacom, but you might well rise to become CEO of a tech company.

Anyway. After Microsoft I took a job at Viacom, because I wanted to learn something about the internet and Microsoft was willfully ignoring it in those days. But at Viacom, I was just an in-house programmer, several layers removed from anybody who did anything that made Viacom money in any way.

And I could tell that no matter how critical it was for Viacom to get this internet thing right, when it came time to assign people to desks, the in-house programmers were stuck with 3 people per cubicle in a dark part of the office with no line-of-sight to a window, and the “producers,” I don’t know what they did exactly but they were sort of the equivalent of Turtle on Entourage, the producers had their own big windowed offices overlooking the Hudson River. Once at a Viacom Christmas party I was introduced to the executive in charge of interactive strategy or something. A very lofty position. He said something vague and inept about how interactivity was very important. It was the future. It convinced me that he had no flipping idea whatsoever what it was that was happening and what the internet meant or what I did as a programmer, and he was a little bit scared of it all, but who cares, because he’s making 2 million dollars a year and I’m just a typist or “HTML operator” or whatever it is that I did, how hard can it be, his teenage daughter can do that.

So I moved across the street to Juno Online Services. This was an early internet provider that gave people free dial-up accounts that could only be use for email. It wasn’t like Hotmail or Gmail, which didn’t exist yet, because you didn’t need internet access to begin with, so it was really free.

Juno was, allegedly, supported by advertising. It turned out that advertising to the kinds of people who won’t pay $20 a month for AOL is not exactly the most lucrative business in the world, so in reality, Juno was supported by rich investors. But at least Juno was a product company where programmers were held in high regard, and I felt good about their mission to provide email to everyone. And indeed I worked there happily for about three years as a C++ programmer. Eventually, though, I started to discover that the management philosophy at Juno was old fashioned. The assumption there was that managers exist to tell people what to do. This is quite upside-down from the way management worked in typical west-coast high tech companies. What I was used to from the west coast was an attitude that management is just an annoying, mundane chore someone has to do so that the smart people can get their work done. Think of an academic department at a university, where being the chairperson of the department is actually something of a burden that nobody really wants to do; they’d much rather be doing research. That’s the Silicon Valley style of management. Managers exist to get furniture out of the way so the real talent can do brilliant work.

Juno was founded by very young, very inexperienced people—the president of the company was 24 years old and it was his first job, not just his first management job—and somewhere in a book or a movie or a TV show he had gotten the idea that managers exist to DECIDE.

If there’s one thing I know, it’s that managers have the least information about every technical issue, and they are the last people who should be deciding anything. When I was at Microsoft, Mike Maples, the head of the applications division, used to have people come to him to resolve some technical debate they were having. And he would juggle some bowling pins, tell a joke, and tell them to get the hell out of his office and solve their own damned problems instead of coming to him, the least qualified person to make a technical decision on its merits. That was, I thought, the only way to manage smart, highly qualified people. But the Juno managers, like George Bush, were the deciders, and there were too many decisions to be made, so they practiced something I started calling hit-and-run micromanagement: they dive in from nowhere, micromanage some tiny little issue, like how dates should be entered in a dialog box, overriding the opinions of all the highly qualified technical people on the team who had been working on that problem for weeks, and then they disappeared, so that’s the hit-and-run part, because there’s some other little brush fire elsewhere that needs micromanagement.

So, I quit, without a real plan.

(Part three will appear tomorrow).

December 3, 2007 by Joel Spolsky

Talk at Yale: Part 1 of 3

New developer, News

This is part one of the text of a talk delivered to the Yale Computer Science department on November 28. The rest of the talk will be published tomorrow and Wednesday.

I graduated with a B.S. in Computer Science in 1991. Sixteen years ago. What I’m going to try to do today is relate my undergraduate years in the CS department to my career, which consists of developing software, writing about software, and starting a software company. And of course that’s a little bit absurd; there’s a famous part at the beginning of MIT’s Introduction to Computer Science where Hal Abelson gets up and explains that Computer Science isn’t about computers and it isn’t a science, so it’s a little bit presumptuous of me to imply that CS is supposed to be training for a career in software development, any more than, say, Media Studies or Cultural Anthropology would be.

I’ll press ahead anyway. One of the most useful classes I took was a course that I dropped after the first lecture. Another one was a class given by Roger Schank that was so disdained by the CS faculty that it was not allowed to count towards a degree in computer science. But I’ll get to that in a minute.

The third was this little gut called CS 322, which you know of as CS 323. Back in my day, CS 322 took so much work that it was a 1½ credit class. And Yale’s rule is, that extra half credit could only be combined with other half credits from the same department. Apparently there were two other 1½ credit courses, but they could only be taken together. So through that clever trickery, the half credit was therefore completely useless, but it did justify those weekly problem sets that took 40 hours to complete. After years of students’ complaining, the course was adjusted to be a 1 credit class, it was renumbered CS 323, and still had weekly 40 hour problem sets. Other than that, it’s pretty much the same thing. I loved it, because I love programming. The best thing about CS323 is it teaches a lot of people that they just ain’t never gonna be programmers. This is a good thing. People that don’t have the benefit of Stan teaching them that they can’t be programmers have miserable careers cutting and pasting a lot of Java. By the way, if you took CS 323 and got an A, we have great summer internships at Fog Creek. See me afterwards.

As far as I can tell, the core curriculum hasn’t changed at all. 201, 223, 240, 323, 365, 421, 422, 424, 429 appear to be almost the same courses we took 16 years ago. The number of CS majors is actually up since I went to Yale, although a temporary peak during the dotcom days makes it look like it’s down. And there are a lot more interesting electives now than there were in my time. So: progress.

For a moment there, I actually thought I’d get a PhD. Both my parents are professors. So many of their friends were academics that I grew up assuming that all adults eventually got PhDs. In any case, I was thinking pretty seriously of going on to graduate school in Computer Science. Until I tried to take a class in Dynamic Logic right here in this very department. It was taught by Lenore Zuck, who is now at UIC.

I didn’t last very long, nor did I understand much of anything that was going on. From what I gather, Dynamic Logic is just like formal logic: Socrates is a man, all men are mortal, therefore Socrates is mortal. The difference is that in Dynamic Logic truth values can change over time. Socrates was a man, now he’s a cat, etc. In theory this should be an interesting way to prove things about computer programs, in which state, i.e., truth values, change over time.

In the first lecture Dr. Zuck presented a few axioms and some transformation rules and set about trying to prove a very simple thing. She had a computer program “f := not f,” f is a Boolean, that simply flipped a bit, and the goal was to prove that if you ran this program an even number of times, f would finish with the same value as it started out with.

The proof went on and on. It was in this very room, if I remember correctly, it looks like the carpet hasn’t been changed since then, and all of these blackboards were completely covered in the steps of the proof. Dr. Zuck used proof by induction, proof by reductio ad absurdum, proof by exhaustion—the class was late in the day and we were already running forty minutes over—and, in desperation, proof by graduate student, whereby, she says, “I can’t really remember how to prove this step,” and a graduate student in the front row says, “yes, yes, professor, that’s right.”

And when all was said and done, she got to the end of the proof, and somehow was getting exactly the opposite result of the one that made sense, until that same graduate student pointed out where, 63 steps earlier, some bit had been accidentally flipped due to a little bit of dirt on the board, and all was well.

For our homework, she told us to prove the converse: that if you run the program “f := not f” n times, and the bit is in the same state as it started, that n must be even.

I worked on that problem for hours and hours. I had her original proof in front of me, going in one direction, which, upon closer examination, turned out to have all kinds of missing steps that were “trivial,” but not to me. I read every word about Dynamic Logic that I could find in Becton, and I struggled with the problem late into the night. I was getting absolutely nowhere, and increasingly despairing of theoretical computer science. It occurred to me that when you have a proof that goes on for pages and pages, it’s far more likely to contain errors in the proof as our own intuition about the trivial statements that it’s trying to prove, and I decided that this Dynamic Logic stuff was really not a fruitful way of proving things about actual, interesting computer programs, because you’re more likely to make a mistake in the proof than you are to make a mistake in your own intuition about what the program “f := not f” is going to do. So I dropped the course, thank God for shopping period, but not only that, I decided on the spot that graduate school in Computer Science was just not for me, which made this the single most useful course I ever took.

Now this brings me to one of the important themes that I’ve learned in my career. Time and time again, you’ll see programmers redefining problems so that they can be solved algorithmically. By redefining the problem, it often happens that they’re left with something that can be solved, but which is actually a trivial problem. They don’t solve the real problem, because that’s intractable. I’ll give you an example.

You will frequently hear the claim that software engineering is facing a quality crisis of some sort. I don’t happen to agree with that claim—the computer software most people use most of the time is of ridiculously high quality compared to everything else in their lives—but that’s beside the point. This claim about the “quality crisis” leads to a lot of proposals and research about making higher quality software. And at this point, the world divides into the geeks and the suits.

The geeks want to solve the problem automatically, using software. They propose things like unit tests, test driven development, automated testing, dynamic logic and other ways to “prove” that a program is bug-free.

The suits aren’t really aware of the problem. They couldn’t care less if the software is buggy, as long as people are buying it.

Currently, in the battle between the geeks and the suits, the suits are winning, because they control the budget, and honestly, I don’t know if that’s such a bad thing. The suits recognize that there are diminishing returns to fixing bugs. Once the software hits a certain level of quality that allows it to solve someone’s problem, that person will pay for it and derive benefit out of it.

The suits also have a broader definition of “quality.” Their definition is about as mercenary as you can imagine: the quality of software is defined by how much it increases my bonus this year. Accidentally, this definition of quality incorporates a lot more than just making the software bug-free. For example, it places a lot of value on adding more features to solve more problems for more people, which the geeks tend to deride by calling it “bloatware.” It places value on aesthetics: a cool-looking program sells more copies than an ugly program. It places value on how happy a program makes its users feel. Fundamentally, it lets the users define their own concept of quality, and decide on their own if a given program meets their needs.

Now, the geeks are interested in the narrowly technical aspects of quality. They focus on things they can see in the code, rather than waiting for the users to judge. They’re programmers, so they try to automate everything in their life, and of course they try to automate the QA process. This is how you get unit testing, which is not a bad thing, don’t get me wrong, and it’s how you get all these attempts to mechanically “prove” that a program is “correct.” The trouble is that anything that can’t be automated has to be thrown out of the definition of quality. Even though we know that users prefer software that looks cooler, there’s no automated way to measure how cool looking a program is, so that gets left out of the automated QA process.

In fact what you’ll see is that the hard-core geeks tend to give up on all kinds of useful measures of quality, and basically they get left with the only one they can prove mechanically, which is, does the program behave according to specification. And so we get a very narrow, geeky definition of quality: how closely does the program correspond to the spec. Does it produce the defined outputs given the defined inputs.

The problem, here, is very fundamental. In order to mechanically prove that a program corresponds to some spec, the spec itself needs to be extremely detailed. In fact the spec has to define everything about the program, otherwise, nothing can be proven automatically and mechanically. Now, if the spec does define everything about how the program is going to behave, then, lo and behold, it contains all the information necessary to generate the program! And now certain geeks go off to a very dark place where they start thinking about automatically compiling specs into programs, and they start to think that they’ve just invented a way to program computers without programming.

Now, this is the software engineering equivalent of a perpetual motion machine. It’s one of those things that crackpots keep trying to do, no matter how much you tell them it could never work. If the spec defines precisely what a program will do, with enough detail that it can be used to generate the program itself, this just begs the question: how do you write the spec? Such a complete spec is just as hard to write as the underlying computer program, because just as many details have to be answered by spec writer as the programmer. To use terminology from information theory: the spec needs just as many bits of Shannon entropy as the computer program itself would have. Each bit of entropy is a decision taken by the spec-writer or the programmer.

So, the bottom line is that if there really were a mechanical way to prove things about the correctness of a program, all you’d be able to prove is whether that program is identical to some other program that must contain the same amount of entropy as the first program, otherwise some of the behaviors are going to be undefined, and thus unproven. So now the spec writing is just as hard as writing a program, and all you’ve done is moved one problem from over here to over there, and accomplished nothing whatsoever.

This seems like a kind of brutal example, but nonetheless, this search for the holy grail of program quality is leading a lot of people to a lot of dead ends. The Windows Vista team at Microsoft is a case in point. Apparently—and this is all based on blog rumors and innuendo—Microsoft has had a long term policy of eliminating all software testers who don’t know how to write code, replacing them with what they call SDETs, Software Development Engineers in Test, programmers who write automated testing scripts.

The old testers at Microsoft checked lots of things: they checked if fonts were consistent and legible, they checked that the location of controls on dialog boxes was reasonable and neatly aligned, they checked whether the screen flickered when you did things, they looked at how the UI flowed, they considered how easy the software was to use, how consistent the wording was, they worried about performance, they checked the spelling and grammar of all the error messages, and they spent a lot of time making sure that the user interface was consistent from one part of the product to another, because a consistent user interface is easier to use than an inconsistent one.

None of those things could be checked by automated scripts. And so one result of the new emphasis on automated testing was that the Vista release of Windows was extremely inconsistent and unpolished. Lots of obvious problems got through in the final product… none of which was a “bug” by the definition of the automated scripts, but every one of which contributed to the general feeling that Vista was a downgrade from XP. The geeky definition of quality won out over the suit’s definition; I’m sure the automated scripts for Windows Vista are running at 100% success right now at Microsoft, but it doesn’t help when just about every tech reviewer is advising people to stick with XP for as long as humanly possible. It turns out that nobody wrote the automated test to check if Vista provided users with a compelling reason to upgrade from XP.

I don’t hate Microsoft, really I don’t. In fact, my first job out of school was actually at Microsoft. In those days it was not really a respectable place to work. Sort of like taking a job in the circus. People looked at you funny. Really? Microsoft? On campus, in particular, it was perceived as corporate, boring, buttoned-down, making inferior software so that accountants can do, oh I don’t know, spreadsheets or whatever it is that accountants do. Perfectly miserable. And it all ran on a pathetic single-tasking operating system called MS-DOS full of arbitrary stupid limitations like 8-character file names and no email and no telnet and no Usenet. Well, MS-DOS is long gone, but the cultural gap between the Unixheads and the Windows users has never been wider. This is a culture war. The disagreements are very byzantine but very fundamental. To Yale, Microsoft was this place that made toy business operating systems using three-decades-old computer science. To Microsoft, “computer sciency” was a bad word used to make fun of new hires with their bizarre hypotheses about how Haskell is the next major programming language.

Just to give you one tiny example of the Unix-Windows cultural war. Unix has this cultural value of separating user interface from functionality. A righteous Unix program starts out with a command-line interface, and if you’re lucky, someone else will come along and write a pretty front end for it, with shading and transparency and 3D effects, and this pretty front end just launches the command-line interface in the background, which then fails in mysterious ways, which are then not reflected properly in the pretty front end which is now hung waiting for some input that it’s never going to get.

But the good news is that you can use the command line interface from a script.

Whereas the Windows culture would be to write a GUI app in the first place, and all the core functionality would be tangled up hopelessly with the user interface code, so you could have this gigantic application like Photoshop that’s absolutely brilliant for editing photos, but if you’re a programmer, and you want to use Photoshop to resize a folder of 1000 pictures so that each one fits in a 200 pixel box, you just can’t write that code, because it’s all very tightly bound to a particular user interface.

Anyway, the two cultures roughly correspond to highbrow vs. lowbrow, and in fact, it’s reflected accurately in the curriculum of computer science departments throughout the country. At Ivy League institutions, everything is Unix, functional programming, and theoretical stuff about state machines. As you move down the chain to less and less selective schools Java starts to appear. Move even lower and you literally start to see classes in topics like Microsoft Visual Studio 2005 101, three credits. By the time you get to the 2 year institutions, you see the same kind of SQL-Server-in-21-days “certification” courses you see advertised on the weekends on cable TV. Isn’t it time to start your career in (different voice) Java Enterprise Beans!

(Part two will appear tomorrow).