Corporate “DEI” is an imperfect vehicle for deeply meaningful ideals

February 10, 2025February 10, 2025 mipsytipsyDEI, diversity, management, MEI, meritocracy, tech culture8 Comments

I have not thought or said much about DEI (Diversity, Equity and Inclusion) over the years. Not because I don’t care about the espoused ideals — I suppose I do, rather a lot — but because corporate DEI efforts have always struck me as ineffective and bland; bolted on at best, if not actively compensating for evil behavior.

I know how crisis PR works. The more I hear a company natter on and on about how much it cares for the environment, loves diversity, values integrity, yada yada, the more I automatically assume they must be covering their ass for some truly heinous shit behind closed doors.

My philosophy has historically been that actions speak louder than words. I would one million times rather do the work, and let my actions speak for themselves, than spend a lot of time yapping about what I’m doing or why.

I also resent being treated like an expert in “diversity stuff”, which I manifestly am not. As a result, I have always shrugged off any idea that I might have some personal responsibility to speak up or defend these programs.

Recent events (the tech backlash, the govt purge) have forced me to sit down and seriously rethink my operating philosophy. It’s one thing to be cranky and take potshots at corporate DEI efforts when they seem ascendant and powerful; it’s another when they are being stamped out and reviled in the public mind.

Actually, my work does not speak for itself

It took all of about thirty seconds to spot my first mistake, which is that no, actually, my work does not and cannot speak for itself. 🤦 No one’s does, really, but especially not when your job literally consists of setting direction and communicating priorities.

Maybe this works ok at a certain scale, when pretty much anyone can still overhear or participate in any topic they care about. But at some point, not speaking up at the company level sends its own message.

If you don’t state what you care about, how are random employees supposed to guess whether the things they value about your culture are the result of hard work and careful planning, or simply…emergent properties? Even more importantly, how are they supposed to know if your failures and shortcomings are due to trying but failing or simply not giving a shit?

These distinctions are not the most important (results will always matter most), but they are probably pretty meaningful to a lot of your employees.

The problem isn’t the fact that companies talk about their values, it’s that they treat it like a branding exercise instead of an accountability mechanism.

Fallacy #1: “DEI is the opposite of excellence or high performance”

There are two big category errors I see out there in the world. To be clear, one is a lot more harmful (and a lot more common, and increasingly ascendant) than the other, but both of these errors do harm.

The first error is what I heard someone call the “seesaw fallacy”: the notion that DEI and high performance are somehow linked in opposition to each other, like a seesaw; getting more of one means getting less of the other.

This is such absolute horseshit. 🙄 It fails basic logic, as well as not remotely comporting with my experience. You can kind of see where they’re coming from, but only by conveniently forgetting that every team and every company is a system.

Nobody is born a great engineer, or a great designer, or a great employee of any type. Great contributors are not born, they are forged — over years upon years of compounding experiences: education, labor, hard work, opportunities and more.

So-called “merit-based” hiring processes act like outputs are the only thing that matter; as though the way people show up on your doorstep is the way they were fated to be and the way they will always be. They don’t see people as inputs to the system — people with potential to grow and develop, people who may have been held back or disregarded in the past, people who will achieve a wide range of divergent outcomes based on the range of different experiences they may have in your system.

Fallacy #2: “DEI is the definition of excellence or high performance”

There is a mirror image error on the other end of the spectrum, though. You sometimes hear DEI advocates talk as though if you juuuust build the most diverse teams and the most inclusive culture, you will magically build better products and achieve overwhelming success in all of your business endeavors.

This is also false. You still have to build the fucking business! Your values and culture need to serve your business and facilitate its continued existence and success.

With the small caveat that … DEI isn’t the way you define excellence unless the way you define excellence is diversity, equity and inclusion, because “excellence” is intrinsically a values statement of what you hold most dear. This definition of excellence would not make sense for a profit-driven company, but valuing diverse teams and an inclusive culture over money and efficiency is a perfectly valid and coherent stance for a person to take, and lots of people do feel this way!

There is no such thing as the “best” or “right” values. Values are a way of navigating territory and creating alignment where there IS no one right answer. People value what they value, and that is their right.

DEI gets caricatured in the media as though the goal of DEI is diverse teams and equitable outcomes. But DEI is better seen as a toolkit. Your company values ought to help you achieve your goals, and your goals as a business usually some texture and nuance beyond just profit. At Honeycomb, for example, we talk about how we can “build a company people are proud to be part of”. DEI can help with this.

Let’s talk about MEI (Merit, Excellence and Intelligence)

Until last month I remained blissfully unaware of MEI, or “Merit, Excellence and Intelligence” (sic), and if you were too until just this moment, I apologize for ruining your party.

This idea that DEI is the opposite of MEI is particularly galling to me. I care a lot about high-performing teams and building an environment where people can do the best work of their lives. That is why I give a shit about building an inclusive culture.

An inclusive culture is one that sets as many people as possible up to soar and succeed, not just the narrow subset of folks who come pre-baked with all of life’s opportunities and advantages. When you get better at supporting folks and building a culture that foregrounds growth and learning, this both raises the bar for outcomes for everyone, and broadens the talent base you can draw from.

Honestly, I can’t think of anything less meritocratic than simply receiving and replicating all of society’s existing biases. Do you have any idea how much talent gets thrown away, in terms of unrealized potential? Let’s take a look at some of those stories from recent history.

If you actually give a shit about merit, you have to care about inclusion

Remember the Susan Fowler blog post that led to Travis Kalanick’s ouster as CEO of Uber in 2017? I suggest going back and skimming that post again, just to remind yourself what an absolutely jaw-dropping barrage of shit she went through, starting with being propositioned for sex by her very own manager on her very first day.

In “What You Do Is Who You Are”, investor Ben Horowitz wrote,”By all accounts Kalanick was furious about the incident, which he saw as a woman being judged on issues other than performance.” He believed that by treating her this way, his employees were failing to live up to their stated values around meritocracy.

I think that’s a flawed (but revealing) response to the situation at hand. Treating this like a question of “merit” suggests that they should be prioritizing the needs of whoever was most valuable to the company. And it kind of seems like that’s exactly what Kalanick’s employees were trying to do.

Susan was brilliant, yes; she was also young (25!) small, quiet, with a soft voice, in a corporate environment that valued aggression and bombast. She was early in her career and comparatively unproven; and when she reported her engineering manager’s persistent sexual advances and retaliatory actions to HR, she was told that HE was the high performer they couldn’t afford to lose.

Ask yourself this: would the manager’s behavior have been any more acceptable if Susan had been a total fuckup, instead of a certifiable genius? (NO. 😡)

Susan’s piece also noted that the percentage of women in Uber’s SRE org dropped from 25% to 3% across that same one year interval. Alarm bells were going off all over the place for an entire year, and nobody gave a shit, because an inclusive culture was nowhere on their radar as a thing that mattered.

There is no rational conversation to be had about merit that does not start with inclusion

You might know (or think you know) who your highest performers are today, but you do not know who will be on that list in six months, one year, five years. Your company is a system, and the environment you build will drive behaviors that help determine who is on that list.

Maybe you have a Susan Fowler type onboarding at your company right now. How confident are you that she will be treated fairly and equitably, that she will feel like she belongs? Do you think she might be underestimated due to her gender or presentation? Do you think she would want to stick around for the long haul? Will she be motivated to do her best work in service of your mission? Why?

Can you say the same about all your employees, not just ones you already know to be certifiable geniuses?

That’s inclusion. That’s how you build a real fucking meritocracy. You start with “do not tolerate the things that kneecap your employees in their pursuit of excellence”, and ESPECIALLY not the things that subject them to the compounding tax of being targeted for who they are. In life as in finance, it’s the compound interest that kills you, more than the occasional expensive purchase.

There’s more to merit and excellence than just inclusion, obviously, but there’s no rational adult conversation to be had about merit or meritocracy that doesn’t start there.

Susan left the tech industry, by the way. She seems to be doing great, of course, but what a loss for us.

If you give a shit about merit, tell me what you are doing to counteract bias

Anyone who talks a big game about merit, but doesn’t grapple with how to identify or counteract the effects of bias in the system, doesn’t really care about merit at all. What they actually want is what Ijeoma Oluo calls “entitlement masquerading as meritocracy” (“Mediocre”).

The “just world fallacy” is one of those cognitive biases that will be with us forever, because we have such a deep craving for narrative coherence. On a personal level, we are embodied beings awash with intrinsic biases; on a societal level, obviously, structural inequities abound. No one is saying we should aim for equality of outcomes, despite what some nutbag MEI advocates seem to think.

But anyone who truly cares about merit should feel compelled to do at least some work to try and lean against the ways our biases cause us to systematically under-value, under-reward, under-recognize, and under-promote some people (and over-value others). Because these effects add up to something cumulatively massive.

In the Amazon book “Working Backwards”, chapter 2, they briefly mention an engineering director who “wanted to increase the gender diversity of their team”, and decided to give every application with a female-gendered name a screening call. The number of women hired into that org “increased dramatically”.

That’s it — that’s the only tweak they made. They didn’t change the interview process, they didn’t “lower the bar”, they didn’t do anything except skip the step where women’s resumes were getting filtered out due to the intrinsic biases of the hiring managers.

There’s no shame in having biases — we all have them. The shame is in making other people pay the price for your unexamined life..

DEI is an imperfect vehicle for deeply meaningful ideals

I am by no means trying to muster a blanket defense of everything that gets lumped under DEI, just to be clear. Some of it is performative, ham-handed, well-intentioned but ineffective, disconnected or a distraction from real problems; diversity theater; a steam valve to vent off any real pressure for change; nitpicky and authoritarian, flirts with thought policing, or just horrendously cringe.

I don’t know how much I really care whether corporate DEI programs live or die, because I never thought they were that effective to start with. Jay Caspian Kang wrote a great piece in the New Yorker that captured my feelings on the matter:

The problem, at a grand scale, is that D.E.I.’s malleability and its ability to survive in pretty much every setting, whether it’s a nearby public school or the C.I.A., means that it has to be generic and ultimately inoffensive, which means that, in the end, D.E.I. didn’t really satisfy anyone.

What it did was provide a safety valve (I am speaking about D.E.I. in the past tense because I do think it will quickly be expunged from the private sector as well) for institutions that were dealing with racial and social-justice problems. If you had a protest on campus over any issue having to do with “diverse students” who wanted “equity,” that now became the provenance of D.E.I. officers who, if they were doing their job correctly, would defuse the situation and find some solution—oftentimes involving a task force—that made the picket line go away.

~Jay Caspian Kang, “What’s the Point of Trump’s War on DEI?”

It’s a symbolic loss of something that was only ever a symbolic gain. Corporate DEI programs as we know them sprung up in the wake of the Black Lives Matter protests of 2020, but I haven’t exactly noticed the world getting substantially more diverse or inclusive since then.

Which is not to say that tech culture has not gotten more diverse or inclusive over the longer arc of my career; it absolutely, definitely has. I began working in tech when I was just a teenager, over 20 years ago, and it is actually hard to convey just how much the world has changed since then.

And not because of corporate DEI policies. So why? Great question. 🙌

Tech culture changed because hearts and minds were changed

I think social media explains a lot about why awareness suddenly exploded in the 2010s. People who might never have intentionally clicked a link about racism or sexism were nevertheless exposed to a lot of compelling stories and arguments, via retweets and stuff ending up in their feed. I know this, because I was one of them.

The 2010s were a ferment of commentary and consciousness-raising in tech. A lot of brave people started speaking up and sharing their experiences with harassment, abuse, employer retaliation, unfair wage practices, blatant discrimination, racism, predators.. you name it. People were comparing notes with each other and realizing how common some of these experiences were, and developing new vocabulary to identify them — “missing stair”, “sandpaper feminism”, etc.

If you were in tech and you were paying attention at all, it got harder and harder to turn a blind eye. People got educated despite themselves, and in the end…many, many hearts and minds were changed.

This is what happened to me. I came from a religious and political background on the far right, but my eyes were opened. The more I looked around, the more evidence I saw in support of the moral and intellectual critiques I was reading online. I began waking up to some of the ways I had personally been complicit in doing harm to others.

The “unofficial affirmative action movement” in tech, circa 2010-2020

And I was not alone. Emily once offhandedly referred to an “unofficial affirmative action movement” in tech, and this really struck a chord with me. I know so many people whose hearts and minds were changed, who then took action.

They worked to diversify their personal networks of friends and acquaintances; to mentor, sponsor, and champion underrepresented folks in their workplaces; to recruit, promote, and refer women and people of color; to invite marginalized folks to speak at their conferences and on their panels; to support codes of conduct and unconscious bias training; and to educate themselves on how to be better allies in general.

All of this was happening for at least a decade leading up to 2020, when BLM shook up the industry and led to the creation of many corporate DEI initiatives. Kang, again:

What happened in many workplaces across the country after 2020 was that the people in charge were either genuinely moved by the Floyd protests or they were scared. Both the inspired and the terrified built out a D.E.I. infrastructure in their workplaces. These new employees would be given titles like chief diversity officer or C.D.O., which made it seem like it was part of the C-suite, and would be given a spot at every table, but much like at Stanford Law, their job was simply to absorb and handle any race stuff that happened.

The pivot from lobbying/persuading from the outside to holding the levers of formal power is a hard, hard one to execute well. History is littered with the shells of social movements that failed to make this leap.

You got here because you persuaded and earned credibility based on your stories and ideals, and now people are handing you the reins to make the rules. What do you do with them? Uh oh.

It’s easier to make rules and enforce them than it is to change hearts and minds

I think this happened to a lot of DEI advocates in the 2020-2024 era, when corporations briefly invested DEI programs and leaders with some amount of real corporate power, or at least the power to make petty rules. And I do not think it served our ideals well.

I just think…there’s only so much you can order people to do, before it backfires on you. Which doesn’t mean that laws and policies are useless; far from it. But they are limited. And they can trigger powerful backlash and resentment when they get overused as a means of policing people’s words and behaviors, especially in ways that seem petty or disconnected from actual impact.

When you lean on authority to drive compliance, you also stop giving people the opportunity to get on board and act from the heart.

MLK actually has a quote on this that I love, where he says “the law cannot make a man love me”:

“It may be true that the law cannot make a man love me, religion and education will have to do that, but it can restrain him from lynching me. And I think that’s pretty important also. And so that while legislation may not change the hearts of men, it does change the habits of men.”

~ Dr. Martin Luther King, Jr.

There are ways that the DEI movement really lost me around the time they got access to formal levers of power. It felt like there was a shift away from vulnerability and persuasion and towards mandates and speech policing.

Instead of taking the time to explain why something mattered, people were simply ordered to conform to an ever-evolving, opaque set of speech patterns as defined by social media. Worse, people sometimes got shamed or shut down for having legitimate questions.

There’s a big difference between saying that “marginalized people shouldn’t have to constantly have to defend their own existence and do the work of educating other people” (hard agree!), and saying that nobody should have to persuade or educate other folks and bring them along.

We do have to persuade, we do have to bring people along with us. We do have to fight for hearts and minds. I think we did a better job of this without the levers of formal power.

Don’t underestimate what a competitive advantage diversity can be

People have long marveled at the incredible amount of world class engineering talent we have always had at Honeycomb — long before we even had any customers, or a product to sell them. How did we manage this? The relative diversity of our teams has always been our chief recruiting asset.

There is a real hunger out there on the part of employees to work at a company that does more than the bare minimum in the realm of ethics. Especially as AI begins chewing away at historically white collar professions, people are desperate for evidence that you can be an ambitious, successful, money-making business that is unabashed about living its values and holding a humane, ethical worldview.

And increasingly, one of the main places people go to look for evidence that your company has ethical standards and takes them seriously is…the diversity of your teams.

Diversity is an imperfect proxy for corporate ethics, but it’s not a crazy one.

The diversity of your teams over the long run rests on your ability to build an inclusive culture and equitable policies. Which depends on your ability to infuse an ethical backbone throughout your entire company; to balance short-term and long-term investments, as you build a company that can win at business without losing its soul.

And I’m not actually talking about junior talent. Competition is so fierce lower on the ladder, those folks will usually take whatever they can get. (💔) I’m talking about senior folks, the kind of people who have their pick of roles, even in a weak job market. You might be shocked how many people out there will walk away from millions/year in comp at Netflix, Meta or Google, in order to work at a company where ethics are front and center, where diversity is table stakes, where their reporting chain and the executive team do not all look alike.

The longer you wait to build in equity and inclusion, the tougher it will be

Founders and execs come up to me rather often and ask what the secret is to hiring so many incredible contributors from underrepresented backgrounds. I answer: “It’s easy!…if you already have a diverse team.”

It is easier to build equitable programs and hire diverse teams early, and not drive yourself into a ditch, than it is to go full tilt with a monoculture and face years of recovery and repair. The longer you wait to do the work, the harder the work is going to be. Don’t put it off.

As I wrote a while back:

“If you don’t spend time, money, attention, or political capital on it, you don’t care about it, by definition. And it is a thousand times worse to claim you value something, and then demonstrate with your actions that you don’t care, than to never claim it in the first place.”

“You must remind yourself as you do, uneasily, queasily, how easily ‘I didn’t have a choice’ can slip from reason to excuse. How quickly ‘this isn’t the right time’ turns into ‘never the right time’. You know this, I know this, and I guarantee you every one of your employees knows this.”

~ Pragmatism, Neutrality and Leadership

It can be a massive competitive advantage if you build a company that knows how to develop a deep bench of talent and set people up for success.

Not only the preexisting elite, the smartest and most advantaged decile of talent — for whom competition will always be cutthroat — but people from broader walks of life.

Winning at business is what earns you the right to make bigger bets and longer-term investments

As the saying goes, “Nobody ever got fired for buying IBM” — and nobody ever had the failure of their startup blamed on the fact that they hired engineers away (or followed management practices) from Google, Netflix or Facebook, regardless of how good or bad those engineers (or practices) may be.

If you want to do something different, you need to succeed. People cargo cult the culture of places that make lots of money.

If you want your values and ideals to spread throughout the industry, the most impactful thing you can possibly do is win.

It’s a reality that when you’re a startup, your resources are scarce, your time horizons are short. You have to make smart decisions about where to invest them. Perfection is the enemy of success. Make good choices, so you can live to fight another day.

But fight another day.

If you don’t give a shit, don’t try and fake it

Finally let me say this: if you don’t give a shit about diversity or inclusion, don’t pretend you give a shit. It isn’t going to fool anyone. (If you “really care” but for some reason DEI loses every single bake-off for resources, friend, you don’t care.)

And honestly, as an employee, I would rather work for a soulless corporation that is honest with itself and its employees about how decisions get made, than for someone who claims to care about the things I value, but whose actions are unpredictable or inconsistent with those values.

Listen.. There is never just one true way to win. There are many paths up the mountain. There are many ways to win. (And there are many, many, many more ways to fail.)

Nothing that got imported or bolted on to your company operating system was ever going to work, anyway. 🤷 If it doesn’t live on in the hearts and minds of the people who are building the strategy and executing on it, they are dead words.

When I look at the long list of companies who say they are rolling back mentions to DEI internally, I don’t get that depressed. I see a long list of companies who never really meant it anyway. I’m glad they decided to stop performing.

You need a set of operating practices and principles that are internally consistent and authentic to who you are. And you need to do the work to bring people along with you, hearts and minds and all.

So if we care about our ideals, let’s go fucking win.

On Versioning Observabilities (1.0, 2.0, 3.0…10.0?!?)

December 20, 2024 mipsytipsyinstrumentation, observability 2.0, open telemetry, otel2 Comments

Hazel Weakly, you little troublemaker.

As I whined to Hazel over text, after she sweetly sent me a preview draft of her post: “PLEASE don’t post this! I feel like I spend all my time trying to help bring clarity and context to what’s happening in the market, and this is NOT HELPING. Do you know how hard it is to try and socialize shared language around complex sociotechnical topics? Talking about ‘observability 3.0’ is just going to confuse everyone.”

That’s the problem with the internet, really; the way any asshole can go and name things (she said piteously, self-righteously, and with an astounding lack of self-awareness).

Semantic versioning is cheap and I kind of hate it

I’m complaining, because I feel sorry for myself (and because Hazel is a dear friend and can take it). But honestly, I actually kind of loathe the 1.0 vs 2.0 (or 3.0) framing myself. It’s helpful, it has explanatory power, I’m using it…but you’ll notice we aren’t slapping “Honeycomb is Observability 2.0” banners all over the website or anything.

Semantic versioning is a cheap and horrendously overused framing device in both technology and marketing. And it’s cheap for exactly these reasons…it’s too easy for anyone to come along and bump the counter again and say it happens to be because of whatever fucking thing they are doing.

I don’t love it, but I don’t have a better idea. In this case, the o11y 2.0 language describes a real, backwards-incompatible, behavioral and technical generational shift in the industry. This is not a branding exercise in search of technological justification, it’s a technical sea change reaching for clarification in the market.

One of the most exciting things that happened this year is that all the new observability startups have suddenly stopped looking like cheaper Datadogs (three pillars, many sources of truth) and started looking like cheaper Honeycombs (wide, structured log events, single source of truth, OTel-native, usually Clickhouse-based). As an engineer, this is so fucking exciting.

(I should probably allow that these technologies have been available for a long time; adoption has accelerated over the past couple of years in the wake of the ZIRP era, as the exploding cost multiplier of the three pillars model has become unsustainable for more and more teams.)

Some non-controversial “controversial claims”

Firstly, I’m going to make a somewhat controversial claim in that you can get observability 2.0 just fine with “observability 1.0” vendors. The only thing you need from a UX standpoint is the ability to query correlations, which means any temporal data-structure, decorated with metadata, is sufficient.”

This is not controversial at all, in my book. You can get most of the way there, if you have enough time and energy and expertise, with 1.0 tooling. There are exceptions, and it’s really freaking hard. If all you have is aggregate buckets and random exemplars, your ability to slice and dice with precision will be dramatically limited.

This matters a lot, if you’re trying to e.g. break down by any combination of feature flags, build IDs, canaries, user IDs, app IDs, etc in an exploratory, open-ended fashion. As Hazel says, the whole point is to “develop the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.” A-yep.

However, any time your explanation takes more than 30 seconds, you’ve lost your audience. This is at least a three-minute answer. Therefore, I typically tell people they need structured log events.

“Observability 2.0” describes a sociotechnical sea change that is already well underway

Let’s stop talking about engineering for a moment, and talk about product marketing.

A key aspect of product marketing is simplification. That’s where the 2.0 language grew out of. About a year ago I started having a series of conversations with CTOs and VPEngs. All of them are like, “we already have observability, how is Honeycomb any different?” And I would launch off into a laundry list of features and capabilities, and a couple minutes later you see their eyes glazing over.

You have to have some way of boiling it down and making it pithy and memorable. And any time you do that, you lose some precision. So I actually disagree with very little Hazel has said in this essay. I’ve made most of the same points, in various times and places.

Good product marketing is when you take a strong technical differentiator and try to find evocative, resonant ways of making it click for people. Bad product marketing — and oh my god is there a lot of that — is when you start with the justification and work backwards. Or start with “well we should create our own category” and then try to define and defend one for sales purposes.

Or worst of all — “what our competitors are saying seems to be really working, but building it would take a long time and be very hard, so what if we just say the same words out loud and confuse everyone into buying our shit instead?”

(Ask me how many times this has happened to us, I fucking dare you.)

Understanding your software in the language of your business

Here’s why I really hate the 3.0 framing: I feel like all the critical aspects that I really really care about are already part of 2.0. They have to be. It’s the whole freaking point of the generational change which is already underway.

We aren’t just changing data structures for the fun of it. The whole point is to be able to ask better questions, as Hazel correctly emphasizes in her piece.

Christine and I recently rewrote our company’s mission and vision. Our new vision states:

Understand your software in the language of your business.

Decades on, the promise of software and the software industry remains unfulfilled. Software engineering teams were supposed to be the innovative core of modern business; instead they are order-takers, cost centers, problem children. Honeycomb is here to shape a future where there is no divide between building software and building a business — a future where software engineers are truly the innovation engine of modern companies.

The beauty of high cardinality, high dimensionality data is that it gives you the power to pack dense quantities of application data, systems data, and business data all into the same blob of context, and then explore all three together.

Austin Parker wrote about this earlier this year (ironically, in response to yet another of Miss Weakly’s articles on observability):

Even if you’ve calculated the cost of downtime, you probably aren’t really thinking about the relationship between telemetry data and business data. Engineering stuff tends to stay in the engineering domain. Here’s some questions that I’d suggest most people can’t answer with their observability programs, but are absolutely fucking fascinating questions (emphasis mine):

What’s the relationship between system performance and conversions, by funnel stage? Break it down by geo, device, and intent signals.

What’s our cost of goods sold per request, per customer, with real-time pricing data of resources?

How much does each marginal API request to our enterprise data endpoint cost in terms of availability for lower-tiered customers? Enough to justify automation work?

Every truly interesting question we ask as engineers is some combination or intersection of business data + application data. We do no one any favors by chopping them up and siloing them off into different tools and data stores, for consumption by different teams.

Data lake ✅, query flexibility ✅, non-engineering functions…🚫

Hazel’s three predictions for what she calls “observability 3.0” are as follows:

Observability 3.0 backends are going to look a lot like a data lake-house architecture

Observability 3.0 will expand query capabilities to the point that it mostly erases the distinction between pay now / pay later, or “write time” vs “read time”

Observability 3.0 will, more than anything else, be measured by the value that non-engineering functions in the business are able to get from it

I agree with the first two — in fact, I think that’s exactly the trajectory that we’re on with 2.0. We are moving fast and accelerating in the direction of data lakehouse architectures, and in the direction of fast, flexible, and cheap querying. There’s nothing backwards-incompatible or breaking about these changes from a 2.0 -> 3.0 perspective.

Which brings us to the final one. This is the only place in the whole essay where there may be some actual daylight between where Hazel and I stand, depending on your perspective.

Other business functions already have nice things; we need to get our own house in order

No, I don’t think success will be measured by non-engineering functions’ ability to interrogate our data. I think it’s the opposite. I think it is engineers who need to integrate data about the business into our own telemetry, and get used to using it in our daily lives.

They’ve had nice things on the business side for years — for decades. They were rolling out columnar stores for business intelligence almost 20 years ago! Folks in sales and marketing are used to being able to explore and query their business data with ease. Can you even imagine trying to run a marketing org if you had to pre-define cohorts into static buckets before you even got started?

No, in this case it’s actually engineering that are the laggards. It’s a very “the cobbler’s children have no shoes” kind of vibe, that we’re still over here warring over cardinality limits and pre-defined metrics and trying to wrestle them into understanding our massively, sprawlingly complex systems.

So I would flip that entirely around. The success of observability 2.0 will be measured by how well engineering teams can understand their decisions and describe what they do in the language of the business.

Other business functions already have nice tools for business data. What they don’t have — can’t have — is observability that integrates systems and application data in the same place as their business data. Uniting all three sources, that’s on us.

If every company is now a technology company, then technology execs need to sit at the big table

Hazel actually gets at this point towards the end of her essay:

We’ve had multiple decades as an industry to figure out how to deliver meaningful business value in a transparent manner, and if engineering leaders can’t catch up to other C-suites in that department soon, I don’t expect them to stick around another decade

The only member of the C-suite that has no standard template for their role is…CTO. CTOs are all over the freaking map.

Similarly, VPs of Engineering are usually not part of the innermost circles of execs.

Why? Because the point of that inner circle of execs is to co-make and co-own all of the decisions at the highest level about where to invest the company’s resources.

And engineering (and product, and design) usually can’t explain their decisions well enough in terms of the business for them to be co-owned and co-understood by the other members of the exec team. R&D is full of the artistes of the company. We tell you what we think we need to do our jobs, and you either trust us or you don’t.

(This is not a one-way street, of course; the levers of investment into R&D are often opaque, counter-intuitive and poorly understood by the rest of the exec team, and they also have a responsibility to educate themselves well enough to co-own these decisions. I always recommend these folks start by reading “Accelerate”.)

But twenty years of free money has done poorly by us as engineering leaders. The end of the ZIRP era is the best thing that could have happened to us. It’s time to get our house in order and sit at the big table.

“Know your business, run it with data”, as Jeff Gray, our COO, often says.

Which starts with having the right tools.

~charity

“Founder Mode” and the Art of Mythmaking

December 17, 2024December 18, 2024 mipsytipsyairbnb, brian chesky, culture, founder mode, leadership, management, paul graham29 Comments

I’ve never been good at “hot takes”. Anyone who knows anything about marketing can tell you that the best time to share your opinion about something is when everyone is all worked up about it. Hot topics drive clicks and eyeballs and attention en masse.

Unfortunately, my internal combustion engine doesn’t run that way. If anything, my fuel runs the other way. If everybody’s already buzzing about something, I feel like chances are, everything that needs to be said is already being said by someone else, so why should I bother?

Earlier this year I started writing a piece on why “hire great people and get out of their way” is such terrible, dangerous, counterproductive advice to give anyone in a leadership role. Then Paul Graham dropped his famous essay on “founder mode”, inspired by a talk given at a YC event by Brian Chesky. PG called it “a talk everyone who was there will remember…Most founders I talked to afterward said it was the best they’d ever heard.” The internet went nuts for it.

What I should have done: put my head down and finished the fucking piece. 🙄

What I actually did: ragetweeted a long thread from bed, read a bunch of other people’s takes, then went “well, all the bases seem to be covered” and lost all interest in finishing.

For the curious, here are the takes I really liked:

https://x.com/ejames_c/status/1830411301413421552, on why AirBNB’s free cash flow margin is due to their prepayment business model and has nothing whatsoever to do with ‘founder mode’, by Cedric Chin
https://x.com/clairevo/status/1830241899841732846, on why this is not unique to founders and is simply how good leaders operate, by Claire Vo
https://oxide.computer/blog/reflections-on-founder-mode, by Bryan Cantrill
and especially, maybe most of all: https://davekarpf.substack.com/p/paul-graham-and-the-cult-of-the-founder

A month and a half later, we all got to see what the fuss was about. Keith Rabois interviewed Brian Chesky at a Khosla Ventures event in NYC and posted the ensuing 45 min video to YouTube, calling it “Founder Mode and the Art of Hiring”.

The gripping tale of Airbnb’s dramatic rise, crash, and rebirth

Chesky starts off by relating a story about how Airbnb in its early years hired way too many people, way too fast, and buckled under all the nasty consequences of hypergrowth. Lack of clarity and direction, excessive coordination costs, lack of focus, layers of bureaucracy that added no value or expertise, empire building, you name it.

So it’s 2019, and it’s just starting to dawn on Brian Chesky that he has this massive clusterfuck on his hands. But Airbnb is barrelling towards an IPO, so he feels like his hands are tied. Then COVID hits. Airbnb loses 80% of its business in 8 weeks, going from “the hottest IPO since Uber” to facing possible bankruptcy and dissolution, practically overnight. You never want to let a crisis go to waste, so Chesky seizes the opportunity to restructure the company and make a bunch of massive changes.

This is a fascinating story, right? It is! Or it should be. A young, first-time founder hits it big with his first startup, barrels through a decade of hypergrowth and free money towards a white hot IPO, then belatedly realizes everything he’s done has resulted in a big, bloated, horrendously inefficient company where nobody can get shit done and all the top talent is leaving. Then comes the pandemic. Holy shit! How will he turn things around??

This is an incredible story. I want to hear this story.

The problem is that he somehow manages to tell it in the most aggravating possible way, where he is a lone hero, buffeted by mediocrity and held back by his own employees at every turn. Actual quote:

“Oh my god, I guess I’m not crazy. I’m just made to believe I’m crazy by my own employees. You’re not crazy. Even though people who work for you tell you you are. You’re not crazy.”

He talks about the people who worked for him in supremely belittling terms — “C players”, “incapable”, “mediocre”, “worst people”. And he takes absolutely zero responsibility for the corporate disaster that developed in slow motion under his watch, while taking ALL the credit for its recovery.

How might another person have told this story?

I mean…if it was me, I might have started off by confessing that “Wow, I did not do a good job as CEO for the first decade of running my company. I over-hired, underspecified the roles, did a terrible job of setting expectations and rewarding the skills and behaviors that really mattered, didn’t know what org charts were for, and in general just completely failed to build a company that valued efficiency, or had any kind of effective strategy or culture of high performance”.

If Brian Chesky had done that I would have been like, “THIS MAN IS A HERO, EVERYONE STOP WHAT YOU ARE DOING AND COME HEAR HIS HARD WON WISDOM”. Instead, the way he tells the story, the problem is always everyone else, and the solution is always more Brian Chesky.

But Brian Chesky created the fucking problems, by being bad at running the business!

There is actually no shame in this! He is right: being a CEO is fucking hard. It does not come naturally. Nobody is born good at it. It takes a lot of hard work and pain and suffering to become someone who is good at running a company. I was CEO of Honeycomb for 3.5 years, and it almost killed me. I never got good at it. I have immense respect for the people who do it well.

But this attitude he has, where the buck stops literally everywhere but him — is one I find so fucking repellent. Ethics aside, I also feel like it constitutes a material risk to any company when the CEO is so lacking in humility and self-awareness. (I can leave room for the possibility that he is actually humble as fuck and he just…chose not to share those reflections with us in this talk. 🤷)

It took me a month to make it through the entire recording

I’ll be honest, I made it about three minutes into the video before I blew my fucking top and closed the tab. It made me so angry. This fucking guy. It pushes all my buttons.

But then I had a few conversations with other founders who did watch the whole thing, people I genuinely respect. I kept hearing there was great advice in the piece, if you can just get past the attitude and total lack of accountability.

It took me over a month to make it through the full thing, in fits and starts, but once I finally did, I had to admit that they were right. There is good advice inside, and there are reasonable principles embedded in this talk. Chesky seems to have successfully turned his company around, after all. That’s a really hard thing to do!

In the end, I forced myself to buckle down and get this piece out because … between PG’s “founder mode” essay and the wide distribution of the Chesky interview, these opinions have already imprinted onto generations of Silicon Valley founders and leaders. They have seeped into the water table, and there’s no going back.

I would PREFER the enduring legacy of both “founder mode” and Brian Chesky’s “The Art of Hiring” to be one that moves the industry forward in material ways, and not one that further entrenches the Silicon Valley cult of the founder, Great Man of History, 10x engineer Lone Ranger superhero John Galt type bullshit that has dogged our heels for decades. And there is some decent material here! We can work with this.

So let’s take the major points he makes, one at a time, and mine them for gold nuggets. Here we go!

The story, in Brian Chesky’s words

My apologies for the extremely long quotes, but I think they set the stage well. (Lightly edited for readability.)

“You know, we were one of the first ‘unicorns’, before that was a term. And it was amazing for a bit, from like 2009-2014. It was awesome. It was fun. It was exciting. And then one day it was horrible. And that day went on for like six years (emphasis mine). And basically what happened was I realized you can kind of be born a good founder…I think I was a pretty good founder the day we started the company…But I’m not sure any of us are born good CEOs.

But the other problem with being a CEO is I think almost all the advice and everything they teach at like Harvard Business School…is wrong. For example, the role of a great leader is to hire great people and empower them to do their job…If you do that, your company will be destroyed.”

I’ve never been to Harvard Business School, but I would be pretty surprised to learn that they don’t cover things like organizational structure, span of control, or operational efficiency.

We had a company where we were like a matrix organization. And so like we had all these different teams. And by the way, there’s no governor of how many teams there are. So teams can create teams, can create sub teams, can create sub teams, that people can decide how many manager levels they create. Like if you’re not careful people do this. And why do they do this? Because they want to have new teams.”

(The “governor of how many teams there are” is whoever leads your People team or HR, btw, who in turn rolls up to the CEO. Again, org design is a pretty traditional and well-studied aspect of operating a company.)

So let’s take a marketing or creative department. There’s a team in Airbnb doing graphics and different parts of the site need graphics, advertising needs graphics. And when it was five teams, the five teams would ask the graphics department for graphics and they’d have like five requests. And then pretty soon it’s 20 teams and once it’s 20 teams…they’re like the deli, there’s a line out the block, there’s a multi-month wait. And then what happens is the graphics team, the central service, kind of like gives up and everything seems pointless. And the teams waiting forever give up and they say, ‘give me my own people’. So now they get their own graphics team. So now you have 5 or 10 graphics teams. And you can do the same thing with technology. And product. Oh, you can have 10 data teams that have different metrics and we can go down the list.

So now you have 10 divisions. Now those 10 divisions are wanting to go in different directions. And they have general managers. And GMs are like little Russian babushka dolls. They want to create miniature GMs and miniature-miniature GMs. And so now you don’t have 10 teams, you’ve got actually 100 teams, because you’ve got these little babushkas running around and they’re going in 100 directions with different technology…

You end up with a lot of bureaucracy. You end up with a company where there’s meetings about meetings where metrics and strategic priorities are the only thing that bind the company together. There’s no cohesive product roadmap, everything is a different time horizon. It’s all short term oriented. And the biggest problem of all is a CEO gets separated from their own product.

And I noticed this thing where there was more bureaucracy, there were these divisions, the divisions then they have to advocate for resources. That advocacy creates politics. And then you have a situation where it’s hard to track what everyone’s doing. So you have like this free for all. There’s not a lot of accountability, which leads to complacency. The complacency means that, like the bad people, the good people are indistinguishable. So the good people tend to move on. They say the company’s changed, the company slows down, and one day you wake up.

Sounds like a mess, all right.

(Chesky’s use of the passive voice here is truly spectacular. Who was in charge for those horrible six years while all this organizational fuckery and uncontrolled sprawl was happening? Oh right, you were.)

To sum up: before the pandemic, Airbnb seems to have had multiple business divisions, each of which had its own GM and a whole ass org structure, with its own engineering, design, marketing teams, etc. This seems wildly weird and inefficient and crazy to me, given that Airbnb only has one product, which is Airbnb? But, they did. So yeah, I am unsurprised that this did not work well.

Which brings us to our first lesson on efficiency.

You should have as few employees as possible

“So what did I do? The first thing I did is I went from a divisional structure to a functional organization. Functional organizations are when you have design and engineering and product management or product marketing and sales. So we went back to a functional organization where our goal was to have as few employees as possible…We said we were the Navy Seals, not the Navy. We want a small, lean, elite, highly skilled team, not a team of kind of mid-level battalion type people. And the reason why is that every person brings with them a communication tax.”

Basically, Brian Chesky is rediscovering this graphic and it’s blowing his mind.

I feel like this should be really fucking obvious, but I guess the legacy of hypergrowth companies proves that it is not: You should ALWAYS have as few employees as possible. Always. Hiring more people should never be the first lever you reach for, it’s what you do after exhausting your other options. Doing great things with a small team is always something to brag about.

(Okay…maybe not ALWAYS-always. There are some business models where your revenue scales linearly along with headcount, but for your average VC-funded technology startup, “we want a small, lean, elite, highly skilled team” is like saying “you should eat vegetables”.)

Your managers should be subject matter experts

“Oh and by the way, you have leaders that are, quote, managers. I don’t like managers. We don’t have a single manager at Airbnb. And I put that in air quotes. A manager that doesn’t know how to do the job is like a cavalry general that can’t ride a horse. A lot of companies do that. So we only allowed managers that were experts but for a long time we had managers. And one day I woke up and I realized I had 50 year olds, managing 40 year olds, managing 30 year olds, managing interns, doing the job with all these layers that weren’t adding any value.”

The disgust in his voice when he says the word “managers” is palpable. And it’s gross. You can talk about the importance of managers being highly skilled in their domain — and I have, many times! — without treating people with contempt, or disparaging them in public for performing the exact jobs that, again, your own company defined and hired them to do, and they faithfully did, for years.

The moral of the story is valid. The tone is unwarranted and disrespectful (and the whiff of ageism is just the rotten little cherry on top).

As for his claim that “A lot of companies do that” — hire managers that aren’t experts in their field, who just do pure people management — no? Maybe? Not that I’m aware of, not in the past decade. Citation needed.

You don’t manage people, you manage people through the work

“I got rid of all quote managers or they left the company and we said you can only manage the function if you’re an expert. So like the head of design has to actually manage the work first. You don’t manage people. You manage people through the work. I learned this from Johnny Ive because most heads of design, at most tech companies don’t actually manage design. They manage the people. Johnny Ive would say no, my main job is to manage the work and I build a team and we design together. But I’m mostly looking at the work. I’m not like having career conversations all day long. That’s crazy.”

Again, I’m not sure where he gets this idea that at “most tech companies”, the head of design is just like…hired from Starbucks or something for their people management skills? So mystifying.

“The best way to get rid of meetings is to not have so many people”

“The reason there’s too many meetings in a company isn’t because they don’t have no-meeting Wednesdays, it’s because they have too many people. People create meetings, and the best way to get rid of meetings is to not have so many people. There’s no other better way to do that (emphasis mine).”

Um…it might be a mistake to read this too literally, but this is a really stupid thing to say. People do incur coordination costs, but just to be clear, there are lots of ways to get rid of meetings, no matter how many people you do or don’t have, and you should absolutely be investing in some of them in an ongoing way. For example,

Develop a rich written culture and rituals around async work
Make recordings available, use AI transcription and summaries, or take notes and send them around
Use calendar plugins to visualize where your time is going, or even automatically reschedule meetings to compact your calendar and create blocks of focus time (e.g. Clockwise)
Declare calendar bankruptcy for meetings with >3 people every quarter, like Spotify does
Use ‘optional’ invites to be clear whether you’re inviting someone because you need them there vs for awareness purposes, or because you think they might be interested
Simply remind people that they own their calendar, and it’s okay to decline!

Synchronous meetings are one of many, many ways to coordinate between people and groups. There are others. Explore and experiment.

Maybe don’t call your employees “C players”, “incapable people” or “non world class”

“So you end up with this situation where non world class people, you know the old saying ‘A players hire A players, B players hire C players’, I would like to amend it. B players hire LOTS of C players, not just a few but a lot, because those are the kind of people that like building empires. If you can’t capably do your job, you don’t hire people better than you, and a person less capable than you can’t do the job.

So you need three incapable people because one incapable person can’t actually do all the work. But now three incapable people are just going in three different directions, creating all these meetings and all this administrative tax.”

Deep breaths.

Ok. My goal for this piece is NOT to spend the whole time complaining about Brian Chesky and his lack of accountability, empathy, or respect (or as a friend of mine put it: “I am prepared to argue that he has no theory of mind for any actor at the company that is not the CEO. The search for the deep truth can stop, Brian doesn’t actually know what people are.”)

I want to invest my own limited time and energy into plucking out the bits of advice he gives that are solid, practical, and actionable, so I can contextualize and expound upon them.

With that in mind, let’s skip right past the insults and acknowledge the fact that there are real challenges here. It’s extremely difficult to evaluate people who are more skilled than you are in the interview process, and harder still to evaluate those who are skilled in a different domain. Developing these muscles as an organization, figuring out what excellence looks like for each level in each role, maintaining a high bar of quality and employee-role fit…these are investments, and they take time and attention.

Constraints fuel creativity. Constraints also fuel efficiency. One of the biggest pathologies of hypergrowth is that when money is free, and everybody is telling you to go go go, grow grow grow! discipline tends to fly out the window. These things are hard to do well even under the best of circumstances; when everyone’s being given unlimited budgets and told to hire their way out of their backlog, well, can you blame them for doing exactly as they’ve been told?

Pretty shitty to retroactively decide they were all losers, if you ask me.

Great leadership is presence, not absence

“Founder mode at its core, though, is about the single principle to be in the details. Great leadership is presence, not absence. So to go back to my lesson, it is not good for you to hire great people and trust them to do their job. How do you know if they’re doing a good job if you’re not in the details?

You should start in the details. And no one does this (emphasis mine). Everyone hires executives and they let them do their thing, and then they find out a year later, the whole thing has been wrong. They’ve hired people they shouldn’t have hired. Now you got to get in the details. And of course, now their confidence goes down. They always inevitably leave the company. And you should actually start in the details, develop trust, develop muscle memory and then let go. So great leadership is presence not absence.”

A-fucking-men.

…Except for the one small fact that Chesky keeps repeating, “no one does this”. My dude, everyone does this. Nobody just hires an executive and sets them loose and doesn’t look over their shoulder for a year. What the flying fuck? That is lunacy. I love that you are discovering basic leadership principles and it is just fucking flooring you, but have you ever cracked a book about management, or talked to another leader? Ever?

Christine and I learned a long time ago not to tell our execs, “I’m not going to tell you how to run your org.” The goal is to do the work to be in alignment so that you don’t have to tell someone how to run their org, because you have a shared idea of what “great” looks like — and what “good enough” looks like — and you can catch deviations early, while they’re easy enough to fix.

Great leadership is presence, not absence; agreed, absolutely. But what does that mean exactly? Fortunately, he’s about to tell us.

“I review every single thing in the company. If I don’t review it, it doesn’t ship.”

“There was this paradox of CEO involvement. The less involved I got in a project, the more dysfunctional it got; the more dysfunctional it got, the more people assumed the dysfunction came from leadership…And then it would get so screwed up, then I would get involved. So what I ended up doing, I took a playbook of Steve Jobs, Elon Musk does this, Jensen Huang does this, Walt Disney does this, all of them do this. (emphasis mine)

If the CEO is the chief product officer in the company, then you should review all of the work. So I review every single thing in the company. If I don’t review it, it doesn’t ship. I review everything on a cadence…If you’re not actually good at product, you don’t have good judgment and you’re not a super skilled product leader, then maybe you shouldn’t be CEO of the company, I don’t know. So let’s assume you’re actually good at what you do, then I think you should review all the work.”

Whuf.

Let’s back up a second. Brian Chesky has led Airbnb on an incredible journey over the past 17 years — from idea to startup to bloated, sprawling post-unicorn behemoth; through a near-death experience, restructuring and IPO; and emerged on the other side of it all as a public company with a share price of $130. He didn’t do this alone (I really loathe the trope where we treat companies like the extension and embodiment of one man’s will to power), but this also doesn’t happen by accident or happenstance.

He deserves credit for this. It’s more than I’ve done! Who cares what I have to say about any of this, really? I don’t have the same degree of believability as Brian Chesky when it comes to how to build a resilient, enduring, high-quality product company.

So let’s listen to someone who does have believability. Here’s what Reed Hastings says in “No Rules Rules: Netflix and the Culture of Reinvention” (share price: $921):

“There’s a whole mythology about CEOs and other senior leaders who are so involved in the details of the business that their product or service becomes amazing. The legend of Steve Jobs was that his micromanagement made the iPhone a great product…Of course, at most companies, even at those who have leaders who don’t micromanage, employees seek to make the decision the boss is most likely to support.

We don’t emulate those top-down models, because we believe we are fastest and most innovative when employees throughout the company make and own decisions. At Netflix, we strive to develop good decision-making muscles everywhere in our company — and we pride ourselves on how few decisions senior management makes (emphasis mine).”

His co-author, Erin Meyer, chimes in:

“People desire and thrive on jobs that give them control over their own decisions. Since the 1980s, management literature has been filled with instructions for how to delegate more and ‘empower employees to empower themselves’…The more people are given control over their own projects, the more ownership they feel, and the more motivated they are to do their best work.”

OMG, confusing!! Evidently ALL of them do NOT do it. What even IS the moral of the story here?! Well…it’s not a simple one, unfortunately. It turns out that you can’t just copy what Brian Chesky did at Airbnb, or what Reed Hastings did at Netflix, and paste it into your company and expect the same results. Bummer!

There are many paths up the mountain

This is an architecture problem. The Chesky/Airbnb architecture is like a monolith application, or a single-threaded process. Everything goes through the CEO, and that’s how they maintain quality. The Hastings/Netflix architecture is more like a microservices application or a threaded, highly concurrent process.

Either can work. Both have tradeoffs and implications. If you try to import either philosophy wholesale, it will break in unexpected ways; if you try to mix and match, it will probably be an unfettered nightmare.

Your architecture will only work if it solves for your problems, utilizing your resources, values, and contingencies. It needs to be authentic, consistent, and internally coherent. This doesn’t mean you can’t learn anything from either of these companies. You can — I have! But you should probably treat them like reference architectures — just-so stories about how individual cultures have successfully evolved in response to their unique challenges and threats, not recipe books.

And I can tell you right away that as an employee, one of these models looks a whole hell of a lot more appealing than the other.

But wait — it gets worse. 😅

Should the CEO interview every candidate?

“I interviewed the first 400 people and I wish I interviewed longer. Maybe my biggest regret is not interviewing the first thousand. I think you should interview every candidate until the recruiting team stages an intervention. Once they stage an intervention, you should interview for two more years after that until everyone threatens to resign…and then you should step away.”

Well. If this is the kind of company you’re choosing to build, then I suppose you may as well be consistent.

Can you be calibrated as an interviewer on every single opening, for every role? My God, no, not even close.

The thing is…I have talked to so many people who work at companies where the CEO insists on interviewing every candidate. It seems to be a trend that is gaining steam rather than losing steam, much to everyone’s misfortune.

Which means that I have personally heard so many anguished stories from angry, frustrated engineering managers who have had their decisions overturned by arrogant CEOs who lacked the skills to evaluate their candidate’s experience, who were biased in blatant and embarrassing ways, who were so fucking overconfident in their own judgment that their teams are constantly having to compensate and apologize and mop up after them.

Want an example? Sure. I recently heard from a director at a 500-person company who spent six months cultivating and recruiting an exceptional hire with an unusual skill set. The candidate made it through their interview loop with flying colors, only for the CEO to reject them because they had recently had a child and were forthright about the fact that work/life balance was a meaningful consideration for them at this point in time. (The director did their best to do damage control, but even though the CEO ultimately relented, the candidate was no longer willing to leave their job. Can you blame them?!?)

It keeps getting worse! Here comes the low point.

“If they would come work for you, they’re not good enough. They’re only good enough if they come to work for me.”

“Can I give you an example of what I do today that no one else, not no one but maybe 95% of public company CEOs don’t do. I have an executive team, right?…I have like seven execs and 40 or 50 VPs. All the directs to my directs dual report to me. I am the co-hiring manager of all the directs to my directs and so we meet and I often tell my directs, ‘I don’t want somebody that you could hire without me. If they would come to work for you, they’re not good enough. They’re only good enough if they come to work for me. So if you can hire them without my help, they’re not good enough.’”

I just about lost my shit over this. Do you hear yourself, bud?

The irony is…I am actually the world’s hugest proponent of skip level 1x1s. I have two or three half-written blog posts in my drafts folder preaching the value of skip levels. I’ve written MULTIPLE twitter threads over the years, talking about how important it is to build relationships with your manager’s managers and your direct reports’ direct reports.

I’ve said that I think skip levels are like end-to-end health checks. It’s important to open a line of communication and explicitly invite critical feedback and bad news. It’s a way to verify that managers are doing a good job managing their teams. It’s how you help iron out telephone games and ensure packets are being transmitted and received up and down the org chart. They are such a critical contribution to organizational health and clear communications, and not enough places invest in them.

I’m also a big proponent of promoting from within, of hiring ambitious people — all of it.

But this attitude towards hierarchy that locates the CEO at the center of every universe, and ranks people in importance according to their proximity…it’s just gross. It’s an attitude that’s contagious; it spreads, like syphilis. And I do not think it unlocks intrinsic motivation or excellence in most humans. It mostly incentivizes a bunch of maladaptive behaviors like sucking up to the CEO.

UGH. Okay, this is getting really long. I’m going to jump rapid-fire through a few final nuggets.

Executive hiring fails when you hire someone at the wrong stage

“Probably the number one reason executive hiring fails is because you hire somebody at the wrong stage. And they were managing instead of building, and you didn’t know that. And so you brought in a manager who is an expert or not so expert, but comfortable in a highly political bureaucracy. And now they have to do things themselves and they can’t. They also have the wrong stage instinct, right? Maybe a CMO used to run $500 million marketing budgets. Now they have a $50 million or $5 million budget, and they don’t know what to do and they can’t do anything themselves.”

Yes, execs can fail because they are managing instead of building, but they can ALSO fail because they are building instead of managing. I’ve worked with execs who operated like they were effectively the most senior IC in the room, and they had…extreme limitations as leaders, let’s put it that way.

Overall, this is a solid point. Being a CMO that takes a company from $1-10m or $10-$50m is a very, very different skill set than taking a company from $50 to $250m, or through an IPO.

We look for executives who can both scale up and scale down. Scale up: you can speak credibly to the board, at the right level of abstraction vs detail, you can craft strategy, see around corners etc. Scale down: you know what “good” looks like for work all over your organization, you can get down in the weeds to help coach a struggling IC back to victory, you can debug a flailing campaign or workflow. Both matter.

References are critical for building confidence in your hires

“I actually prioritize references over interviewing…Andreeson Horowitz would tell me, you should do 8 hours of reference checks per employee.”

Agreed. I’ve said many times that if I had to choose between interviews or references, I would pick references every time. (Fortunately, you don’t have to pick!)

“Ask them who the best people are. Say, ‘okay, separate from this topic, I just want to know who’s the best person you’ve ever worked with.’ Do they say the person’s name you just talked about?”

This trick doesn’t fool anybody.

“Then you ask questions like, okay, what do I need to watch out for? If I were to hire them? What is the one area of development you would give them?”

This is good advice. You should always probe into people’s weaknesses and areas of development. Everyone has them, there’s no shame in that. Hearing details about where they are weak can give you confidence, and set you up better to support them. It gives you richer insight into them as a person and coworker.

A basket of interviewing tips and tricks

“Interviewing. My first tip is you ask follow up questions. You ask them how to explain how they did something. And the key is to ask two followups. You never want to get the first answer, you always want the third answer.”

Asking follow-up questions is a classic technique, and a good one. But don’t let them dominate the conversation with a narrative. You want to be intentional about pulling on specific threads and making sure they answer what you asked, not pull a politician’s move and give the answer they feel like answering. Does the answer sound canned, or are they thinking on their feet?

“Often there’s too many people interviewing for too short a time, not going deep enough. Your interview panel should be as few people as possible, going as deep as possible…3 or 4 people going really deep is better than 8 or 10 people giving you their first impression…and they’re actually mostly thinking about what this means for them.”

Yeah, so this is an area where my thinking has actually changed a lot over the years. I used to cast a much wider net, like I felt like people ought to get to interview anyone who was being hired over them. I’ve come to realize that having too many veto points in the system is dangerous and doesn’t actually add more value. Yes, people like being offered the opportunity to affirmatively vet someone, but at a certain point you have to prioritize the candidate experience — and trust your team to make good choices.

It’s usually better to have a fewer number of interviewers, but make sure they are all well calibrated for the role, and that there’s a certain amount of coordination between interviewers so everyone is covering different questions/aspects of the role. If you have 8 or 10 interviewers, that is way, way too many.

“Every potential hire is guilty until proven innocent. It is the opposite of our justice system. Most people, when they interview, they look for the absence of weaknesses and that is innocence. The presumption is someone’s good. You should always presume somebody is not good. You need proof. They don’t work for you. So you need evidence to hire them, not evidence to eliminate them as a candidate and almost every company gets this wrong. And what they end up doing is hiring mediocre people with an absence of weaknesses, not people that have a preponderance of evidence of being really good and spike in a few areas.”

Again, there is a solid principle buried deep under all this repugnant bullshit about “mediocre people” and “guilty until proven innocent”. Here’s how I would put it: you want to hire people for their unique strengths, not their lack of weaknesses. If they’re strong where you need them to be strong, it’s okay if they aren’t equally superpowered at everything — that’s why we build teams, to supplement and balance each other out.

In the Honeycomb interview process, we emphasize that we want to see you at your best — please help us do that! If you don’t feel like we’ve seen your strengths, please tell us, so we can fix it.

See, how hard was that? Same point, zero jackassery.

There is no such thing as the ‘best people’

Another way to look at it is the quality of the people. People never hire people better than them. So there might be people that are good at their job, but it’s not enough to be good at your job in most large companies. If you are the best in the world at your job, but you can’t hire really great people, then you’re not going to be the best in the world because your team isn’t really good.

God, he does this over and over again, talking about people like they exist on some index you can stack rank or something.

Here’s one small mental hack that makes a world of difference: remember that you are trying to hire the right people to join your team/org/company.

Not the “best” people.

The right people.

The fact that someone isn’t a superstar employee for this company, this product, this team, at this stage, doesn’t mean they might not be a superstar employee for someone else. And people who aren’t “superstar employees” are still worthy of your respect. Not wanting to work your ass off is a perfectly legitimate life choice and does not make them a lower quality human. Maybe they aren’t the right hire for you, but you don’t have to treat them — or talk about them — like shit.

People who work for big, stable companies, are not necessarily bad at their job or incapable of building things. They have a different skill set, they may work at a different tempo, but this doesn’t mean they suck. My god. So fucking condescending.

There’s such a special kind of hubris in these startup kids who are losing tens of millions of dollars a year and looking down their noses at their peers in organizations that are making tens of millions of dollars a year, believing themselves to be categorically better than them just because they can…prototype real fast? Unclear.

Building a world-class team is about more than just hiring

I wrote a piece a few years ago called “The Real 11 Reasons I Don’t Hire You”, where I discussed a few of the many variables that go into deciding who to hire. It’s complicated — it is irreducibly complicated. And it should be.

But it’s also just the beginning. The team, the culture, the sociotechnical systems you hire them into are going to exert a gravitational pull over all of the people you hire. Are you bringing them into an environment that is generative, playful, creative, experimental, intense, competitive, demoralizing, controlling, grinding, aspirational, compliant, hierarchical, passive-aggressive, or aggressive-aggressive? Are standards applied consistently? What behaviors get rewarded or punished, actively or otherwise? Who gets mentored and fast tracked to the top? Who gets the most facetime with the CEO? Is CEO facetime a prized currency? Why? Systems drive behavior.

Sociologists have a term for the cognitive bias that causes us to predictably, consistently over-emphasize individual agency and attributes and underestimate situational factors: the FAE, or Fundamental Attribution Error. This whole interview is sopping with FAE energy.

It’s not as simple as “just hire great people”. You want to hire people who share your values, want to do the job, have the right skills, are motivated, etc, and then the conditions you create for them to work under will either cause them to flourish and feed their creativity and drive, or will crush them and shut them down. The feedback loop runs both ways.

Hypergrowth is hazardous to your company’s health

“In a hypergrowth company, it could even be 50% of your time is hiring.”

Chesky mentions hypergrowth only once and briefly, towards the end, but it’s a vital piece of context if you want to understand the Airbnb story.

As he says, Airbnb was one of the O.G. unicorns — a unicorn before they coined the term ‘unicorn’. It was born in the era of hypergrowth and free money. That’s the only way to make any sense of the fact that a company could pay such comically little attention to efficiency, for so long. (Thirteen years, to be exact.)

When all you have is a hammer, everything looks like a nail. In hypergrowth mode, you solve every problem by throwing more resources at the system. The tools you learn are weird ones, which map awkwardly to the skills you need to run a normal, sustainable company that’s expected to turn a profit. Hypergrowth encourages a raft of bad habits, and attacking every problem by hiring more people is one of them.

This is not good for anyone, except perhaps venture capitalists. The externalities are dreadful. It’s impossible to scale your culture, your practices, your values, or people’s expectations at an equivalent pace. The correction is brutal, when the time finally comes to worry about efficiency — and eventually, everybody needs to worry about efficiency. The higher the ride, the harder the fall. The bill comes due.

The CEO-centric view of the universe

One of my least favorite things about YC is the way it seems to pursue extremely young and inexperienced founders. If you’ve never been a manager, director, VP, staff or principal engineer, it’s a lot easier to look down on those people and disrespect the role they play in the ecosystem.

It looks like Brian Chesky was about 26 years old when he cofounded Airbnb. He has basically been a CEO for his entire career. And this is, I think, a great example of the kind of blinkered perspective you get from someone who has no real idea what it’s like to sit anywhere else on the org chart.

After watching the first 40 minutes of this talk, one might reasonably wonder if Brian Chesky understands that being CEO of a company means being accountable for its outcomes.

What makes all of this extra frustrating is that in the final five minutes, he shows us that he does know this…at least when it comes to board interactions.

“Oftentimes if you take advice from a VC and it doesn’t work and you don’t have traction…You’re still held responsible. So the only thing that matters is you’re successful, not if you listen to them or not. People sometimes forget and they’re like, well, you shouldn’t have listened to me. They don’t say it that way, but that’s kind of the way it happens. So I would just know that, like, you own the outcome no matter what.”

Yeah, bro. You do.

There Is Only One Key Difference Between Observability 1.0 and 2.0

November 19, 2024December 21, 2024 mipsytipsyobservability 2.0, unified storageLeave a comment

Originally posted on the Honeycomb blog on November 19th, 2024

We’ve been talking about observability 2.0 a lot lately; what it means for telemetry and instrumentation, its practices and sociotechnical implications, and the dramatically different shape of its cost model. With all of these details swimming about, I’m afraid we’re already starting to lose sight of what matters.

The distinction between observability 1.0 and observability 2.0 is not a laundry list, it’s not marketing speak, and it’s not that complicated or hard to understand. The distinction is a technical one, and it’s actually quite simple:

Observability 1.0 has three pillars and many sources of truth, scattered across disparate tools and formats.
Observability 2.0 has one source of truth, wide structured log events, from which you can derive all the other data types.

That’s it. That’s what defines each generation, respectively. Everything else is a consequence that flows from this distinction.

Multiple “pillars” are an observability 1.0 phenomenon

We’ve all heard the slogan, “metrics, logs, and traces are the three pillars of observability.” Right?

Well, that’s half true; it’s true of observability 1.0 tools. You might even say that pillars define the observability 1.0 generation. For every request that enters your system, you write logs, increment counters, and maybe trace spans; then you store telemetry in many places. You probably use some subset (or superset) of tools including APM, RUM, unstructured logs, structured logs, infra metrics, tracing tools, profiling tools, product analytics, marketing analytics, dashboards, SLO tools, and more. Under the hood, these are stored in various metrics formats: unstructured logs (strings), structured logs, time-series databases, columnar databases, and other proprietary storage systems.

Observability 1.0 tools force you to make a ton of decisions at write time about how you and your team would use the data in the future. They silo off different types of data and different kinds of questions into entirely different tools, as many different tools as you have use cases.

Many pillars, many tools.

An observability 2.0 tool does not have pillars.

Your observability 2.0 tool has one unified source of truth

Your observability 2.0 tool stores the telemetry for each request in one place, in one format: arbitrarily-wide structured log events.

These log events are not fired off willy-nilly as the request executes. They are specifically composed to describe all of the context accompanying a unit of work. Some common patterns include canonical logs, organized around each hop of the request; traces and spans, organized around application logic; or traces emitted as pulses for long-running jobs, queues, CI/CD pipelines, etc.

Structuring your data in this way preserves as much context and connective tissue as possible about the work being done. Once your data is gathered up this way, you can:

Derive metrics from your log events
Visualize them over time, as a trace
Zoom into individual requests, zoom out to long-term trends
Derive SLOs and aggregates
Collect system, application, product, and business telemetry together
Slice and dice and explore your data in an open-ended way
Swiftly compute outliers and identify correlations
Capture and preserve as much high-cardinality data as you want

The beauty of observability 2.0 is that it lets you collect your telemetry and store it—once—in a way that preserves all that rich context and relational data, and make decisions at read time about how you want to query and use the data. Store it once, and use it for everything.

Everything else is a consequence of this differentiator

Yeah, there’s a lot more to observability 2.0 than whether your data is stored in one place or many. Of course there is. But everything else is unlocked and enabled by this one core difference.

Here are some of the other aspects of observability 2.0, many of which have gotten picked up and discussed elsewhere in recent weeks:

Observability 1.0 is how you operate your code; observability 2.0 is about how you develop your code

Observability 1.0 has historically been infra-centric, and often makes do with logs and metrics software already emits, or that can be extracted with third-party tools
Observability 2.0 is oriented around your application code, the software at the core of your business

Observability 1.0 is traditionally focused on MTTR, MTTD, errors, crashes, and downtime
Observability 2.0 includes those things, but it’s about holistically understanding your software and your users—not just when things are broken

To control observability 1.0 costs, you typically focus on limiting the cardinality of your data, reducing your log levels, and reducing the cost multiplier by eliminating tools.
To control observability 2.0 costs, you typically reach for tail-based or head-based sampling
Observability 2.0 complements and supercharges the effectiveness of other modern development best practices like feature flags, progressive deployments, and chaos engineering.

The reason observability 2.0 is so much more effective at enabling and accelerating the entire software development lifecycle is because the single source of truth and wide, dense, cardinality-rich data allow you do things you can’t in an observability 1.0 world: slice and dice on arbitrary high-cardinality dimensions like build_id, feature flags, user_id, etc. to see precisely what is happening as people use your code in production.

In the same way that whether a database is a document store, a relational database, or a columnar database has an enormous impact on the kinds of workloads it can do, what it excels at and which teams end up using it, the difference between observability 1.0 and 2.0 is a technical distinction that has enduring consequences for how people use it.

These are not hard boundaries; data is data, telemetry is telemetry, and there will always be a certain amount of overlap. You can adopt some of these observability 2.0-ish behaviors (like feature flags) using 1.0 tools, to some extent—and you should try!—but the best you can do with metrics-backed tools will always be percentile aggregates and random exemplars. You need precision tools to unlock the full potential of observability 2.0.

Observability 1.0 is a dinner knife; 2.0 is a scalpel.

Why now? What changed?

If observability 2.0 is so much better, faster, cheaper, simpler, and more powerful, then why has it taken this long to emerge on the landscape?

Observability 2.0-shaped tools (high cardinality, high dimensionality, explorable interfaces, etc.) have actually been de rigeur on the business side of the house for years. You can’t run a business without them! It was close to 20 years ago that columnar stores like Vertica came on the scene for data warehouses. But those tools weren’t built for software engineers, and they were prohibitively expensive at production scale.

FAANG companies have also been using tools like this internally for a very long time. Facebook’s Scuba was famously the inspiration for Honeycomb—however, Scuba ran on giant RAM disks as recently as 2015, which means it was quite an expensive service to run. The falling cost of storage, bandwidth, and compute has made these technologies viable as commodity SaaS platforms, at the same time as the skyrocketing complexity of systems due to microservices, decoupled architecture patterns has made them mandatory.

Three big reasons the rise of observability 2.0 is inevitable

Number one: our systems are exploding in complexity along with power and capabilities. The idea that developing your code and operating your code are two different practices that can be done by two different people is no longer tenable. You can’t operate your code as a black box, you have to instrument it. You also can’t predict how things are going to behave or break, and one of the defining characteristics of observability 1.0 was that you had to make those predictions up front, at write time.

Number two: the cost model of observability 1.0 is brutally unsustainable. Instead of paying to store your data once, you pay to store it again and again and again, in as many different pillars or formats or tools as you have use cases. The post-ZIRP era has cast a harsh focus on a lot of teams’ observability bills—not only the outrageous costs, but also the reality that as costs go up, the value you get out of them is going down.

Yet the cost multiplier angle is in some ways the easiest to fix: you bite the bullet and sacrifice some of your tools. Cardinality is even more costly, and harder to mitigate. You go to bed Friday night with a $150k Datadog bill and wake up Monday morning with a million dollar bill, without changing a single line of code. Many observability engineering teams spend an outright majority of their time just trying to manage the cardinality threshold—enough detail to understand their systems and solve users’ problems, not so much detail that they go bankrupt.

And that is the most expensive part of all: engineering cycles. The cost of the time engineers spend laboring below the value line—trying to understand their code, their telemetry, their user behaviors—is astronomical. Poor observability is the dark matter of engineering teams. It’s why everything we do feels so incredibly, grindingly slow, for no apparent reason. Good observability empowers teams to ship swiftly, consistently, and with confidence.

Number three: a critical mass of developers have seen what observability 2.0 can do. Once you’ve tried developing with observability 2.0, you can’t go back. That was what drove Christine and me to start Honeycomb, after we experienced this at Facebook. It’s hard to describe the difference in words, but once you’ve built software with fast feedback loops and real-time, interactive visibility into what your code is doing, you simply won’t go back.

It’s not just Honeycomb; observability 2.0 tools are going mainstream

We’re starting to see a wave of early startups building tools based on these principles. You’re seeing places like Shopify build tools in-house using something like Clickhouse as a backing store. DuckDB is now available in the open-source realm. I expect to see a blossoming of composable solutions in the next year or two, in the vein of ELK stacks for o11y 2.0.

Jeremy Morrell recently published the comprehensive guide to observability 2.0 instrumentation, and it includes a vendor-neutral overview of your options in the space.

There are still valid reasons to go with a 1.0 vendor. Those tools are more mature, fully featured, and most importantly, they have a more familiar look and feel to engineers who have been working with metrics and logs their whole career. But engineers who have tried observability 2.0 are rarely willing to go back.

Beware observability 2.0 marketing claims

You do have to be a little bit wary here. There are lots of observability 1.0 vendors who talk about having a “unified observability platform” or having all your data in one place. But what they actually mean is that you can pay for all your tools in one unified bill, or present all the different data sources in one unified visualization.

The best of these vendors have built a bunch of elaborate bridges between their different tools and storage systems, so you can predefine connection points between e.g. a particular metric and your logging tool or your tracing tool. This is a massive improvement over having no connection points between datasets, no doubt. But a unified presentation layer is not the same thing as a unified data source.

So if you’re trying to clear a path through all the sales collateral and marketing technobabble, you only need to ask one question: how many times is your data going to be stored?

Is there one source of truth, or many?

How Hard Should Your Employer Work To Retain You?

October 11, 2024October 12, 2024 mipsytipsycompanies, culture, leadership, management6 Comments

Recently we learned that Google spent $2.7 billion to re-hire a single AI researcher who had left to start his own company. As Charlie Brown would say: “Good grief.” 🙄

This is an (incredibly!) extreme example. But back in the halcyon days of the zero interest rate phenomenon (ZIRP), smaller versions of this tale played out daily. Many rank-and-file engineers have stories about submitting their resignation, or threatening to quit, and their managers plying them with stock or cash or promotions to stay. This happened so much that it started to seem like the normal thing to do when you wanted a raise or a promotion. Job hopping for better comp also happened, but people quickly figured out that by merely threatening to leave, you could often get the loot without the hassle of having to actually switch jobs.

Many of these stories have been embellished dramatically over time, as real anecdotes fade into legends of the “my friend knows a person who” or “I read it on Blind” varieties, but the lore is based in reality. It really did happen. The legacy of these episodes is…not great.

To be clear, I do not begrudge employees trying to maximize their wages and comp by changing jobs. It’s the gamification and brinksmanship I object to, and all the ways it ends up distorting company culture and values and outcomes. In the overheated ZIRP environment, lots of companies felt like this is what they were forced to do to compete for talent. Maybe so, maybe not. But money is not the only thing people value, which means that this is not the only way to compete for talent.

After all, the hot air of the inflationary ZIRP bidding wars is what led to the post-ZIRP job market collapse. The boom and bust cycle is stressful and counterproductive, which leads to uneven, disastrously unfair outcomes and an oppositional, extractive mindset on both sides. We can do better. We must do better. Let’s talk about how.

You should stay at your job as long as it fulfills your career priorities

How long should you stay at your job? As long as it’s the best thing you can do for your career, or at least a reasonable, smart career choice, in alignment with your own personal career goals and life priorities.

Maybe this sounds mind-numbingly obvious to you. But far too many people stay far too long at jobs where they aren’t happy, aren’t growing, and aren’t setting their future selves up for success. Hey, I’ve been there…these decisions can be brutal. 💔

Your career is an appreciating, multimillion dollar asset, probably the largest single asset you will ever own. How you define what is best or right for you will inevitably shift over the course of your 40-year career, and that’s fine. This is normal.

But you have to make these decisions based on what is right for you, your career, and your family. Not because, say, you feel responsible for protecting your team from upper management, or you’re afraid of what will happen to the product or the team if you leave, or you feel like you owe them something. Nor should you stay out of fear, whether that be fear of interviews, that this is the best you can do, etc.

Sometimes your top priority might be making the most money, so you can get out of debt. Sometimes it might be a simple, uncomplicated paycheck and low expectations so you can spend a lot of time with your family. Sometimes you may be on a hot streak and raring to go, working like crazy and making a name for yourself in the industry. When in doubt, my advice is to 1) preserve optionality, 2) follow good people and 3) lean into that which energizes you.

The company should employ you as long as it’s a good fit

There are certainly companies where people get fired too quickly or in bad faith. There are also companies where people who are not working out linger on and on and on in the role. It might be tempting to conceive of the latter situation as more worker-friendly, but in all honesty, neither situation is great.

If the wants and needs of the company and the employee are not aligned, you aren’t doing them any favors by dragging it out or keeping them around in a prolonged state of purgatory. If things are decidedly not working out, I promise, they are miserable.

If you are a manager, your number one job is to bring clarity. What are the expectations for the role, what does success look like, what support does the employee need in getting there? When things aren’t going well, your job is to work with them to figure out what is happening, and come up with a plan. Is there a shared understanding of what success looks like in this role? Is it a skills gap, are there relationships that need mending, do they need some time off to deal with personal issues? Are they still interested in the work? Is it still a good fit?

There is an extremely short list of jobs that can only be done by managers, and managing people out (which does sometimes mean firing them, but not always), is at the tippy top of that list. Making sure the right people are on the team is job number one. Figuring this shit out swiftly — we’re talking months, not years — is critical.

Also, none of this happens magically or automatically. This shit is hard. Which is why it is important to invest in these skills and set expectations for your managers.

Your manager should try to make this a great career opportunity for you, for as long as possible

It’s the job of your manager to ensure that this role is a great opportunity for you, for as long as possible. For mid-level engineers this means making sure you are learning and expanding your skill sets, that you have access to mentorship and support systems, that you get to follow your curiosity to some extent and work on things that interest you. For more senior folks, this might mean looking out for opportunities to lead projects or wear new hats.

But that won’t be forever, for anyone — not even your CEO or founders! And that’s okay. This is not a family, it’s a company, and hopefully something of a community.

Sometimes you get an opportunity you can’t refuse. Or life takes you in a different direction. It happens! It is not a tragedy when people leave for a better opportunity or something that excites them.

Real-life example: Paul Osman left Honeycomb because he and his family were moving to NYC and needed a Big Tech salary. He was a wonderful staff engineer (at a time when those were scarce), a high performer, effective across the org, beloved by all; he was even on our board of directors, our first elected employee board member! But when he let us know he was going to leave, we … wished him well. We couldn’t match the salary he needed to pull down; he knew that, we knew that. Nor would it have been fair to all the other staff engineers if we had tried.

Managers need to be actively engaging in career development and planning with their reports. The more you know about someone’s personal values and priorities, the better you can do to try and set them up with opportunities that appeal to them and the trajectory they are on.

Your manager should also be honest if you could find better opportunities elsewhere

I also believe that good managers will be honest with their employees when they feel like this may no longer be the best place for them. Not every opportunity exists at every company, at every time.

It can be hard to admit to your star employee that if you were them, you’d be looking elsewhere for opportunities. Maybe you have an incredible, ambitious senior engineering director who is hungry and chafing to move up, but you don’t expect to see any openings at the VP level over the next year or two. They deserve to know that, I think.

To be clear, you are NOT firing them. Usually, you are holding your breath and praying they will choose to stay. Often they do! Maybe they love their job enough that they’re happy to stick around for another couple years just to see if any openings do arise, or they switch into passive job search mode, taking interesting calls but not actively looking. Maybe you have a conversation about ways they could build their career in other ways, by doing more writing and speaking. Maybe they decide this is a good window of time to have another kid.

But if you can’t honestly look them in the eye and tell them this is the best place for them, given what you know of their ambitions and priorities, you have to say so. It’s on them to decide what to do with that information. But if you want them to trust you when you say this is a great opportunity for their career, you have to be truthful when the opportunity is just not there.

Some amount of employee turnover is natural and healthy

When I worked at Linden Lab in my early twenties, I remember vividly how much pride we took in the fact that people never left. I was there for 4.5 years, and I think we had a single-digit number of departures that entire time. I remember thinking to myself how incredibly special this company must be, because nobody ever wants to leave.

It was a special company. ❤️ But when I look back now, this part makes me cringe. Yep, nobody ever left. No one was ever managed out, even the people who never seemed to do anything but hang out in Second Life or work on whatever the fuck they personally felt like doing. It was a little bit … culty? There were some incredible engineers there, but also a systematic inability to row in the same direction or make a plan and execute on it. In some ways Linden felt more like a social club than a business.

I loved working there, don’t get me wrong, and I learned a lot. But in retrospect, some amount of turnover is good. It’s healthy. It means you have standards for yourselves, and someone is paying attention to whether or not we’re actually making progress and getting shit done, or whether or not the people we need are in the right seats.

Tenure functions somewhat differently at very large companies; it may take years for someone just to come up to speed and learn how to operate within the system, so they do their best to retain people for decades. When you’re a startup in growth mode, though, you become a completely different organism every few years. People who are happy as clams and supremely productive from $0-$1m or 1-50 people may or may not adjust well to the $50m or $200m environment. People who are superstars on one side of the Dunbar number are sometimes bitterly unhappy on the other side.

There’s “regrettable” and “non-regrettable” attritions, but the company should be able to go on operating even in the face of “regrettable” departures.

There are, of course, exceptions. So let’s talk about these.

Sometimes people sit in critical roles at critical moments

At any given time, there exists a subset of people who are disproportionately critical to the success of the business at the moment, people whose departure could seriously jeopardize the company’s ability to meet its goals this quarter or even this year. It sucks, but it’s a reality. This happens.

If that’s a very long list of people, however, or if it’s the same people over and over, or if the actual survival of the company would be in jeopardy and not just a subset of your goals, then your leaders are not doing their fucking job.

Part of the job of running a company is developing talent to be successors to key people. Part of their job is to replicate and distribute critical company knowledge and skills. None of us should be irreplaceable — not even the CEO, or CTO, or founders. If the company’s future depends irrevocably on the continued employment of any individual person, the company’s leaders are fucking up, full stop.

There are two types of disproportionally critical employees: superstars and SPOFs

The right time to determine who is on that critical subset is NOT when one of them resigns. You should be asking yourselves somewhat regularly — which people are our superstars, the ones we really, really want to make sure are happy and fulfilled here, and which people are single points of failure, the ones we cannot easily replace, or function in their absence?

Note that these are not necessarily the same two lists!

This doesn’t have to be a heavyweight process, but if you are large enough to have a People team or HR team, they should be ensuring that talent reviews and succession planning conversations are happening like clockwork, once per quarter or so.

Your superstars are the people who are standout performers, carrying a ton of load for the company or generating uniquely creative ideas, etc. You should identify these people proactively and make sure they are feeling challenged, supported and valued. What are their values — what lights their fire? Where are they trying to go in their career, in their life? How do they like to receive recognition? How does it manifest when they feel overwhelmed or demotivated?

Managers tend to devote most of their attention to their lowest performers. Be wary of this. Yes, give people the support they need. But the biggest bang for your buck is typically the time you spend on your highest performers. Don’t neglect your superstars just because they are doing well.

Get to know your superstars, and compensate them

And compensate your superstars. Whatever pool of money is set aside for high performers at your company, make sure they get a slice of it — a raise, a bonus, direct equity, etc.

But money isn’t the full story, it’s just the first chapter. This is where you need to dig a little deeper and get to know them better — their values, their love languages, how they like to receive recognition. Make sure other company leaders know who is kicking ass and what kind of opportunities they’d be into.

Being a superstar should earn you more than money — it earns the right to experiment, try a moonshot, be first in line for a lateral role change into another area of interest. Maybe you can line them up with a work coach or continuing education, support them writing or presenting their work at conferences…the list is endless What do they value? Find out.

It is normal and desirable for your shortlist of superstars to shift over time. If it’s always the same few names on the list, that may reflect a different problem: that you are handing out all of the opportunities to take risks and shine brightly to the same few people, over and over again. It’s your job to cultivate a deep bench of talent, not one or two lead singers with everyone else in the chorus.

Work on a plan to de-risk your SPOFs

And then there are your single points of failure, people who are the only person who knows how to do something, or the only person in a function. In the early days of any startup, you have a ton of these. As you grow, you should steadily pay down this list.

If superstars are the people you want to keep out of joy, SPOFs can be the people you need to keep out of fear. You can’t function without them, even if they’re mediocre contributors. This is bad on several levels.

This is just a risk analysis you need to work through as a leadership team. Have a plan, have a backup options, and steer a path out of this state as soon as you can afford to.

I’m not naive. The realities of business are real, and sometimes something takes you by surprise, or you need to try and do a diving save for someone who has just announced they are leaving. But that should not be common. The normal, expected reaction when someone tells you they are leaving should be, “ah, that’s too bad, we’ll miss you! I’m so happy for you and this new opportunity you’re excited about!”

Most jobs will be saved or lost by boring organizational labor, not heroic diving saves.

Here is one important reality that many employees don’t seem to grasp:

The harder your employer is affirmatively working to do right by you, the fewer heroics they will be willing or able to do to retain you. And the harder the company is working to be fair and equitable, the less they will be willing or able to make exceptions to their existing compensation framework.

Here is one good end-to-end test of the system: you should not be able to get a higher salary or a larger stock grant by quitting and getting immediately re-hired. If you can, your company is not doing the work to value the labor of its existing employees by the same yardstick as it values new hires.

A lot of companies fail this test! Because in order for this to be true, your company needs to consistently adhere to pay bands, pegged to market rates, adjusted and reconciled each year. They need to do something like boxcar stock grants. They need to periodically audit their own levels and comp and look for evidence of systematic bias. They REALLY need to not make exceptions to their own god damn rules.

As Emily Nakashima says, “Many companies hemorrhage great employees in underrepresented groups because they do all those things but they fail to bring a DEI lens to them — ‘we have salary bands! we have a fair comp system! we think about ladders and promo paths!’ and then they do zero work to make sure those things are applied equitably to all their employees, including across axes of diversity like race and gender.”

All of these things take organizational willpower, and they are hard. It means a lot of hard conversations. It means saying “no” to people. It’s much easier to give out goodies to the people who complain the loudest or threaten to quit, at least in the short term.

It’s easy to talk about fairness and equity, but it takes a lot of structural labor to walk the walk

A lot of work goes into building and maintaining a system that can pass the sniff test in terms of compensating people fairly and equitably, instead of based on their negotiating skills or how much they made at their previous job.

You need to have a job ladder and levels you believe in, ones that accurately reflect the skills, behaviors and values of your org and have broad buy-in from the team. You need a process for leveling people as new hires and at review time, and for appealing those levels when you get it wrong. You should have salary bands for each level, with compa ratios based on market rates. You should be able to show your work and explain your decisions. (For example, we target the 65th% for companies of our size and funding levels, and we pay everyone SF market rates, no matter where they are located in the world.)

This is why review-time calibrations are so important. Calibrations are not about calibrating ICs, they are about calibrating managers. Calibrations are to diminish the inequity that results when one manager has a different understanding of the level an engineer is operating at, so the engineer would receive a different level, band, or rating under a different manager.

Obviously, all of these sociotechnical systems are made and operated by human beings, so there will always be some intrinsic messiness and imprecision. This is why it matters that managers show up with humility and work to get aligned with their peers on what truly matters to the company and the org. This is why it is so important that we show our work and engage with ladders and levels as living documents.

A lot of this labor is invisible to employees, and not especially well understood. I think a critical part of making these systems work is helping employees understand the tradeoffs being made, and how having a consistent leveling system ultimately benefits them, even if they are personally frustrated about not getting promoted this half. Which means every manager needs to be equipped to have these hard conversations with their team.

It should be okay to tell your manager you’re thinking about leaving, and talk about your options

HR teams will typically bucket departures into voluntary and involuntary, aka “regrettable” and “nonregrettable”. In reality, almost any time someone leaves their job, it’s some muddled combination of the two.

In the optimal case, voluntary departures are rarely a complete surprise. Surprises suck. They’re hard to plan around, they often leave gaps in coverage or contributions, and they’re a bummer for morale. You should be able to be honest with your manager and tell them if you’re starting to look around, or if you’re finding yourself less happy and motivated these days. However, this requires a lot of trust in the relationship — that the manager won’t retaliate, won’t fire you, etc — and from what I gather, it seems to be fairly uncommon in the wild. 🙁

Employees do not owe their manager a heads up or a conversation in advance, but this is unequivocally the level of relationship trust we should aim for.

Steph Hippo says, “I love being the manager people want to work for, and it took me a while to figure out how to also be the kind of manager people wanted to have ‘fire’ them by helping them move on. I’m really proud of how many people I’ve been able to help move off my team because we found a better fit. Doing this contributes to your reputation as a leader and as an employer. I found it meaningful if someone that moved on from my team did so on good terms, came back to visit, or sent other people to check out our job listings. That’s a sign that you’re parting with folks on good terms.” 💯

Managers can prove themselves worthy of this trust by not reacting, not retaliating, not treating people any differently, not leaping to conclusions, not running ahead and making decisions or commitments ahead of what the employee has stated.

Should you ever try to change someone’s mind about leaving?

Not never…but rarely. You should always try to understand why someone is leaving. Exit interviews are a great tool here, especially in situations where there has been relationship friction. Departures are a trailing indicator, but often a very powerful signal of things managers should be paying attention to, to make things better for those who remain.

If someone has decided to leave, you’re not going to “save” them via bribery alone. I’ve never seen the tactic of throwing money and titles at someone actually get them to stick around in the long run.

However, I have seen departure announcements get turned around when they include some form of development — when you can identify real underlying sources of discontent, and meet them with action.

Another real life example: A couple years ago, Phillip Carter told us he had decided to leave and take another role in the industry. We had some intense conversations about why that was and what was missing, and realized he had been struggling to connect with the reasons behind what we were building, largely because he had never written or supported code in production during his time as a software engineer. He decided not to leave after that, and he is here to this day.

There will be times when someone has decided to leave, and you want to fight for them to stay. In those situations, you need to get really crystal clear with yourself before taking action. What are the underlying risks to the business, and how far are you prepared to go?

On extremely rare occasions, heroic measures may be the lesser of two evils

Sometimes you may have to try for a diving save. That’s just the reality of doing business, esp at startup stages where you have less redundancy, a shorter planning horizon, more overall chaos and a smaller overall operating budget.

Sometimes your goals are at risk, and you feel like you don’t have a choice. But any time you find yourself bargaining or trying to bribe people to stay after they’ve decided to leave, you should take a hard fucking look at yourself and how you got there, and whether or not you can justify your actions.

Exceptions are often the path of least resistance for the manager making the exception in the moment, but they impose a heavy, compounding cost to the business over time. Any time you make an exception to keep someone, you risk breaking your commitments to everyone else. And rumors about exceptions being made will fly fast and furious (sometimes it seems like there is a 10-20x multiplier of rumors to reality). 😣

I will not sit here and tell you no exceptions can be ever made. Systems made of people are systems that are never perfect. Once in a while, making an exception might actually the way to restore justice to a situation. Other times your ass is well and truly backed into a corner. But exceptions are SO costly to your credibility, you must at least build peer review and consequences for exceptions into the system.

A few checks and balances to consider:

Individual managers should not be able to make an exception without the buy-in of their director, VP, and people team
It should generally trigger some kind of review of the system policy in question, to see if it still serves its purpose
You should be able to look in each other’s eyes and explain your reasoning, and not feel ashamed of it if word gets out

Shit does happen. But if this kind of shit happens on the regular, you can’t blame people for becoming extremely cynical about the way you do business, and you can expect to get way more people trying to game the system to get the same results for themselves.

People should not use threats of leaving to try and effect change or get raises. This should not be an effective tactic — and in order for it not to be an effective tactic, we cannot reward it with results. When you make exceptions, you all but guarantee more people will try this.

People work at jobs for money, but not only money

While writing this piece, a friend told me a story about when he became an engineering manager a decade ago, and soon noticed that his two women engineers were the lowest paid and the lowest leveled people on the team, which didn’t seem to correlate with their actual skills or experience. He asked his own manager what was up with this, and the response he received was: “Well yeah, neither of them has ever been a flight risk.”

This kind of attitude is, to put it politely, a fucking cancer on our industry.

There are two radically different philosophies when it comes to corporate compensation. In the first scenario, you pay people as little as possible, and consider it your job as a manager to extract the most work out of people for the least pay. Information is power, so information asymmetry is endemic in these environments, and people are paid according to their skill at negotiation or brinksmanship. You typically blow your wad trying to compete for the “best” talent in the world.

In the second scenario, you do your best to compensate employees fairly and competitively, balancing their needs and wants against other stakeholders and the overarching mandate for the company as a whole to succeed. You practice transparency and show your work, and actively work to counter systemic biases. You understand you can’t compete for every great hire out there, but you try to equip people with the information they need to evaluate whether or not you are mutually a good fit.

Companies that operate according to the first scenario are so alienating and toxic (and almost certainly illegal, in many cases) that few will openly claim to be this kind of company. Most companies at least pay lip service to equity and fairness. But because everyone is typically mouthing the same kind of things, employees will scrutinize your actions far more than your words, especially when it comes to comp.

At the end of the day, these are jobs. People work at jobs for money, but not only money. I think we would all be better off if we could get better at articulating the tangible and intangible rewards of our labor, treating each other with dignity and honesty, and being straightforward about our needs and wants and goals on both sides, instead of treating comp like some kind of high stakes casino game.

Huge thanks to Steph Hippo, Paul Osman, Phillip Carter, Lesley Cordero, Emily Nakashima for their feedback, critique, and stories.

Is It Time To Version Observability? (Signs Point To Yes)

August 7, 2024August 9, 2024 mipsytipsycanonical logs, columnar storage, databases, metrics, monitoring, observability 2.08 Comments

Augh! I am so behind on so much writing, I’m even behind on writing shit that I need to reference in order to write other pieces of writing. Like this one. So we’re just gonna do this quick and dirty on the personal blog, and not bother bringing it up to the editorial standards of…anyone else’s sites. 😬

If you’d rather consume these ideas in other ways:

I gave a keynote at SRECon in March
Here is a slide deck of my slides from CTO Craft Con London in May
A Screaming In The Cloud podcast with Corey Quinn in April
My piece earlier in the year on the Cost Crisis in Observability Tooling touched on some of the concepts too
Matt Sanabria wrote a great piece comparing us and a bunch of other observability vendors in 2024.

What does observability mean? No one knows

In 2016, we first borrowed the term “observability” from the wikipedia entry for control systems observability, where it is a measure of your ability to understand internal system states just by observing its outputs. We (Honeycomb) then spent a couple of years trying to work out how that definition might apply to software systems. Many twitter threads, podcasts, blog posts and lengthy laundry lists of technical criteria emerged from that work, including a whole ass book.

In 2018, Peter Bourgon wrote a blog post proposing that “observability has three pillars: metrics, logs and traces. Ben Sigelman did a masterful job of unpacking why metrics, logs and traces are just telemetry. However, lots of people latched on to the three pillars language: vendors because they (coincidentally!) had metrics products, logging products, and tracing products to sell, engineers because it described their daily reality.

Since then the industry has been stuck in kind of a weird space, where the language used to describe the problems and solutions has evolved, but the solutions themselves are largely the same ones as five years ago, or ten years ago. They’ve improved, of course — massively improved — but structurally they’re variations on the same old pre-aggregated metrics.

It has gotten harder and harder to speak clearly about different philosophical approaches and technical solutions without wading deep into the weeds, where no one but experts should reasonably have to go.

This is what semantic versioning was made for

Look, I am not here to be the language police. I stopped correcting people on twitter back in 2019. We all do observability! One big happy family. 👍

I AM here to help engineers think clearly and crisply about the problems in front of them. So here we go. Let’s call the metrics, logs and traces crowd — the “three pillars” generation of tooling — that’s “Observability 1.0“. Tools like Honeycomb, which are built based on arbitrarily-wide structured log events, a single source of truth — that’s “Observability 2.0“.

Here is the twitter thread where I first teased out the differences between these generations of tooling (all the way back in December, yes, that’s how long I’ve been meaning to write this 😅).

About two months ago I wrote this thread about how we lost the battle to define ✨observability✨ -- to give it a real, specific, falsifiable technical definition, distinct from monitoring or telemetry.

I complained, I argued, I grieved...and now I'm over it. SO over it. 🙄 https://t.co/QH3U0iZZ8G
— Charity Majors (@mipsytipsy) December 22, 2023

This is literally the problem that semantic versioning was designed to solve, by the way. Major version bumps are reserved for backwards-incompatible, breaking changes, and that’s what this is. You cannot simultaneously store your data across both multiple pillars and a single source of truth.

Incompatible. Breaking change. O11y 1.0, meet O11y 2.0.

small technical changes can unlock waves of powerful sociotechnical transformation

There are a LOT of ramifications and consequences that flow from this one small change in how your data gets stored. I don’t have the time or space to go into all of them here, but I will do a quick overview of the most important ones.

The historical analogue that keeps coming to mind for me is virtualization. VMs are old technology, they’ve been around since the 70s. But it wasn’t until the late 90s that VMware productized it, unlocking wave after wave of change, from cloud computing and SaaS to the very DevOps movement itself.

I believe the shift to observability 2.0 holds a similarly massive potential for change, based on what I see happening today, with teams who have already made the leap. Why? In a word, precision. O11y 1.0 can only ever give you aggregates and random exemplars. O11y 2.0, on the other hand, can tell you precisely what happened when you flipped a flag, deployed to a canary, or made any other change in production.

Will these waves of sociotechnical transformation ever be realized? Who knows. The changes that get unlocked will depend to some extent on us (Honeycomb), and to an even greater extent on engineers like you. Anyway, I’ll talk about this more some other time. Right now, I just want to establish a baseline for this vocabulary.

1.0 vs 2.0: How does the data get stored?

1.0💙 O11y 1.0 has many sources of truth, in many different formats. Typically, you end up storing your data across metrics, logs, traces, APM, RUM, profiling, and possibly other tools as well. Some folks even find themselves falling back to B.I. (business intelligence) tools like Tableau in a pinch to understand what’s happening on their systems.

Each of these tools are siloed, with no connective tissue, or only a few, predefined connective links that connect e.g. a specific metric to a specific log line. Aggregation is done at write time, so you have to decide up front which data points to collect and which questions you want to be able to ask. You may find yourself eyeballing graph shapes and assuming they must be the same data, or copy-pasting IDs around from logging to tracing tools and back.

2.0 💚 Data gets stored in arbitrarily-wide structured log events (often called “canonical logs“), often with trace and span IDs appended. You can visualize the events over time as a

trace, or slice and dice your data to zoom in to individual events, or zoom out to a birds-eye view. You can interact with your data by group by, break down, etc.

You aggregate at read time, and preserve raw events for ad hoc querying. Hopefully, you derive your SLO data from the same data you query! Think of it as B.I. for systems/app/business data, all in one place. You can derive metrics, or logs, or traces, but it’s all the same data.

1.0 vs 2.0: on metrics vs logs

1.0 💙 The workhorse of o11y 1.0 is metrics. RUM tools are built on metrics to understand browser user sessions. APM tools are built using metrics to understand application performance. Long ago, the decision was made to use metrics as the source of truth for telemetry because they are cheap and fast, and hardware used to be incredibly expensive.

The more complex our systems get, the worse of a tradeoff this becomes. Metrics are a terrible building block for understanding rich data, because you have to discard all that valuable context at write time, and they don’t support high (or even medium!) cardinality data. All you can do to enrich the data is via tags.

Metrics are a great tool for cheaply summarizing vast quantities of data. They are not equipped to help you introspect or understand complex systems. You will go broke and go mad if you try.

2.0 💚 The building block of o11y 2.0 is wide, structured log events. Logs are infinitely more powerful, useful and cost-effective than metrics are because they preserve context and relationships between data, and data is made valuable by context. Logs also allow you to capture high cardinality data and data relationships/structures, which give you the ability to compute outliers and identify related events.

1.0 vs 2.0: Who uses it, and how?

1.0 💙 Observability 1.0 is predominantly about how you operate your code. It centers around errors, incidents, crashes, bugs, user reports and problems. MTTR, MTTD, and reliability are top concerns.

O11y 1.0 is typically consumed using static dashboards — lots and lots of static dashboards. “Single pane of glass” is often mentioned as a holy grail. It’s easy to find something once you know what you’re looking for, but you need to know to look for it before you can find it.

2.0 💚 If o11y 1.0 is about how you operate your code, o11y 2.0 is about how you develop your code. O11y 2.0 is what underpins the entire software development lifecycle, enabling engineers to connect feedback loops end to end so they get fast feedback on the changes they make, while it’s still fresh in their heads. This is the foundation of your team’s ability to move swiftly, with confidence. It isn’t just about understanding bugs and outages, it’s about proactively understanding your software and how your users are experiencing it.

Thus, o11y 2.0 has a much more exploratory, open-ended interface. Any dashboards should be dynamic, allowing you to drill down into a question or follow a trail of breadcrumbs as part of the debugging/understanding process. The canonical question of o11y 2.0 is “here’s a thing I care about … why do I care about it? What are all of the ways it is different from all the other things I don’t care for?”

When it comes to understanding your software, it’s often harder to identify the question than the answer. Once you know what the question is, you probably know the answer too. With o11y 1.0, it’s very easy to find something once you know what you’re looking for. With o11y 2.0, that constraint is removed.

1.0 vs 2.0: How do you interact with production?

1.0 💙 You deploy your code and wait to get paged. 🤞 Your job is done as a developer when you commit your code and tests pass.

2.0 💚 You practice observability-driven development: as you write your code, you instrument it. You deploy to production, then inspect your code through the lens of the instrumentation you just wrote. Is it behaving the way you expected it to? Does anything else look … weird?

Your job as a developer isn’t done until you know it’s working in production. Deploying to production is the beginning of gaining confidence in your code, not the denouement.

1.0 vs 2.0: How do you debug?

1.0 💙 You flip from dashboard to dashboard, pattern-matching and looking for similar shapes with your eyeballs.

You lean heavily on intuition, educated guesses, past experience, and a mental model of the system. This means that the best debuggers are ALWAYS the engineers who have been there the longest and seen the most.

Your debugging sessions are search-first: you start by searching for something you know should exist.

2.0 💚 You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

1.0 vs 2.0: The cost model

1.0 💙 You pay to store your data again and again and again and again, multiplied by all the different formats and tool types you are paying to store it in. Cost goes up at a multiplier of your traffic increase. I wrote a whole piece earlier this year on the cost crisis in observability tooling, so I won’t go into it in depth here.

As your costs increase, the value you get out of your tools actually decreases.

If you are using metrics-based products, your costs go up based on cardinality. “Custom metrics” is a euphemism for “cardinality”; “100 free custom metrics” actually means “100 free cardinality”, aka unique values.

2.0 💚 You pay to store your data once. As your costs go up, the value you get out goes up too. You have powerful, surgical options for controlling costs via head-based or tail-based dynamic sampling.

You can have infinite cardinality. You are encouraged to pack hundreds or thousands of dimensions in per event, and any or all of those dimensions can be any data type you want. This luxurious approach to cardinality and data is one of the least well understood aspects of the switch from o11y 1.0 to 2.0.

Many observability engineering teams have spent their entire careers massaging cardinality to control costs. What if you just .. didn’t have to do that? What would you do with your lives? If you could just store and query on all the crazy strings you want, forever? 🌈

Metrics are a bridge to our past

Why are observability 1.0 tools so unbelievably, eyebleedingly expensive? As anyone who works with data can tell you, this is always what happens when you use the wrong tool for the job. Once again, metrics are a great tool for summarizing vast quantities of data. When it comes to understanding complex systems, they flail.

I wrote a whole whitepaper earlier this year that did a deep dive into exactly why tools built on top of metrics are so unavoidably costly. If you want the gnarly detail, download that.

The TLDR is this: tools built on metrics — whether RUM, APM, dashboards, etc — are a bridge to our past. If there’s one thing I’m certain of, it’s that tools built on top of wide, structured logs are the bridge to our future.

Wide, structured log events are the bridge to our future

Five years from now, I predict that the center of gravity will have swung dramatically; all modern engineering teams will be powering their telemetry off of tools backed by wide, structured log events, not metrics. It’s getting harder and harder and harder to try and wring relevant insights out of metrics-based observability tools. The end of the ZIRP era is bringing unprecedented cost pressure to bear, and it’s simply a matter of time.

The future belongs to tools built on wide, structured log events — a single source of truth that you can trace over time, or zoom in, zoom out, derive SLOs from, etc.

It’s the only way to understand our systems in all their skyrocketing complexity. This constant dance with cost vs cardinality consumes entire teams worth of engineers and adds zero value. It adds negative value.

And here’s the weirdest part. The main thing holding most teams back psychologically from embracing o11y 2.0 seems to be the entrenched difficulties they have grappling with o11y 1.0, and their sense that they can’t adopt 2.0 until they get a handle on 1.0. Which gets things exactly backwards.

Because observability 2.0 is so much easier, simpler, and more cost effective than 1.0.

observability 1.0 is the hard way

It’s so fucking hard. We’ve been doing it so long that we are blind to just how HARD it is. But trying to teach teams of engineers to wrangle metrics, to squeeze the questions they want to ask into multiple abstract formats scattered across many different tools, with no visibility into what they’re doing until it comes out eventually in form of a giant bill… it’s fucking hard.

Observability 2.0 is so much simpler. You want data, you just toss it in. Format? don’t care. Cardinality? don’t care.

You want to ask the question, you just ask it. Format? don’t care.

Teams are beating themselves up trying to master an archaic, unmasterable set of technical tradeoffs based on data types from the 80s. It’s an unwinnable war. We can’t understand today’s complex systems without context-rich, explorable data.

We need more options for observability 2.0 tooling

My hope is that by sketching out these technical differences between o11y 1.0 and 2.0, we can begin to collect and build up a vendor-neutral library of o11y 2.0 options for folks. The world needs more options for understanding complex systems besides just Honeycomb and Baselime.

The world desperately needs an open source analogue to Honeycomb — something built for wide structured events, stored in a columnar store (or even just Clickhouse), with an interactive interface. Even just a written piece on how you solved it at your company would help move the industry forward.

My other hope is that people will stop building new observability startups built on metrics. Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. Do something different. Build for the next generation of software problems, not the last generation.

If anyone knows of anything along these lines, please send me links? I will happily collect them and signal boost. Honeycomb is a great, lifechanging tool (and we have a generous free tier, hint hint) but one option does not a movement make.

<3 charity

P.S. Here’s a great piece written by Ivan Burmistrov on his experience using observability 2.0 type tooling at Facebook — namely Scuba, which was the inspiration for Honeycomb. It’s a terrific piece and you should read it.

P.P.S. And if you’re curious, here’s the long twitter thread I wrote in October of 2023 on how we lost the battle to define observability:

So, we lost the battle to define observability. You know it, I know it. Observability was supposed to *mean* something, and in the early days, it did.

"Observability" once meant the kind of exploratory, open ended investigation our systems increasingly demand.
— Charity Majors (@mipsytipsy) October 30, 2023

Pragmatism, Neutrality and Leadership

July 24, 2024July 25, 2024 mipsytipsyculture, leadership, management, neutrality, politics, tech culture2 Comments

Every year or so, some tech CEO does something massively stupid, like declaring “No politics at work!”, or “Trump voters are oppressed and live in fear!”, and we all get a good pained laugh over how out of touch and lacking in self-awareness they are.

We hear a lot about the howlers, and much less about the practical challenges leaders face in trying to create a work environment where people from vastly different backgrounds and belief systems come together in peace to focus on the mission and do good work. Or how that intersects with the deeply polarizing events that now seem to shatter our world every other week — invasions, Supreme Court rulings, elections, school shootings, and the like.

Are we supposed to speak up or stay silent? Share our own beliefs, or take a studiously neutral stance? What do we do if half of the company is numb and reeling with grief, and the other half is bursting with joy? Nothing at all? That feels inhumane. Is the reality that we live in a world where we can only live, work, and interact with people who already agree with us and our political beliefs? God, I hope not. 🙁

This has been on my mind a lot recently. We are 103 days out from a US Presidential election, and it’s going to get worse before it gets better.

So here goes.

Caveats, challenges and cautionary tales

There are some immediate challenges to things I’m trying to say here. A couple:

The term “politics”, much like the term “technical debt”, can mean way too many things. Local, regional or national electoral politics; activities associated with power distribution or resource allocation; influence peddling or status seeking behaviors, putting your needs above the good of the group, and so much more. Therefore I will use the term sparingly, and prefer more specific language where possible.

I don’t often do this, but I am explicitly addressing this piece to other founders and execs. Not because it doesn’t apply to people in other roles; it does. It just got really wordy trying to account for all the possible variations on role, scope and perspective involved.

As a leader, your job is to succeed

This might sound obvious, possibly to the point of idiocy. Yet I think it bears repeating. For all the mountains of forests of trees worth of books that get written every year on leadership, it remains the case that nobody knows what the fuck they’re doing.

“I think great leaders treat money like oxygen: they make sure there is plenty of it, and understand that if you’re talking about it all of the time you’re in deep shit and better take drastic actions to make sure you have enough.” ~ Mark Ferlatte

As a founder or leader of a venture-backed startup or public company, your #1 job is to make the business succeed. Success comes first. It’s Maslow’s hierarchy of needs all over again; you must ensure your company’s continued existence before you earn the right to tinker.

Success in business is what earns you the right to devote more time, attention, and resources to cultural issues, and to experiment with things that matter to you.

One of the most common ways that leaders fail is that they get so bogged down in the daily chaos of running the company, managing a team, raising money, responding to crises and scoring OKRs is that they struggle to keep the focus zeroed in on the most important thing: succeeding at your mission.

Know your mission, craft a strategy, execute

And how do you do that?

Know your mission, craft a strategy, and execute. It’s as simple and straightforward as it is unbelievably difficult and devastatingly complicated.

The system exists to fulfill the mission. I’ve written before about systems thinking in organizations, how hierarchy emerges to benefit the workers, how we look up for purpose and down for function.

Your mission is what brings people together to collectively build something that they could not do as individuals. The more crisp and well articulated your mission, the more employees can tie the work they do back to the mission, the more meaningful their daily work is likely to feel.

Your culture serves the business, not the other way around

A great culture can’t compensate for a weak product that users don’t want. If people want to work at your company more than they want to use your product, that’s a bad sign.

A company culture with tremendous energy, ownership and transparency can be an accelerant to your business, it can grant you unique advantages, and it can help mitigate risks. But it is not why you exist. Your mission is why you exist.

It would be nice to believe that having a warm, supportive culture, with friendly people and four day work weeks, could guarantee success, or at least give you a reliable advantage. Wouldn’t it?

Companies with shitty cultures win all the time

We’ve all watched companies become wildly successful under assholes, while waves of employees leave broken and burned out. I wish this wasn’t true, but it is. People’s lives and careers are just another externality as far as the corporate books are concerned.

Many live through this nightmare and emerge dead set on doing things differently. And so, when they become founders or leaders, they put culture ahead of the business. And then they lose.

Most companies fail, and if you aren’t hungry and zeroed in on the success of your business, your slim chances become even slimmer.

I don’t believe this has to be either/or, cultural success or business success. I think it’s a false dichotomy. I believe that healthy companies can be more successful than shitty ones, all else being equal. Which is why I believe that leaders who care about building a workplace culture rooted in dignity and respect have a responsibility to care even more about success in business. Let’s show these motherfuckers how it’s done. Nothing succeeds like success.

Good culture is rooted in organizational health

Six questions for organizational health, from “The Advantage” by Patrick Lencioni

I feel like a big reason so many leaders get twisted up here is by trying to make employees happy instead of driving organizational health. This is a huge topic, and I won’t go deep on it here, but my understanding of organizational health owes a lot to “The Advantage: Why Organizational Health Trumps Everything Else In Business”, by Patrick Lencioni, with honorable mention going to “ Good Strategy/Bad Strategy”, by Richard Rumelt.

A terrific company culture begins with organizational health: a competent, experienced leadership team that trusts each other, a mission, and a strategy, clarity and good communication. If everyone in the company knows what the most important thing is, and their actions align with that, your company is probably pretty healthy.

People’s feelings matter, and you should treat them with dignity and respect, but you can’t be driven by them. You have to let go of underperformers, deliver hard feedback, set high standards and hold people accountable. A lot of this does not feel good.

You will make mistakes. Things will fail. You will have to spin down teams, or entire orgs. People are going to have huge emotional reactions about your decisions and take things personally. They’ll be angry with you and disagree with your decisions. They will blame you, and maybe they should.

If you do your job well, with some luck, many people will be happy, much of the time. But if your goal is to make people happy, you will fail, and then everyone will be unhappy. Feelings are a trailing indicator and only roughly, occasionally a sign that you are doing a good job.

Survive in the short term, but live your values in the longer term

Most companies have seen times where all of the options seem like bad ones, even a betrayal of their values. There are times that hurt your conscience, or rouse up anger and cynicism in the ranks. Some hypothetical examples:

When you’re doing layoffs to save the company, and realize the list is disproportionately made up of marginalized groups 💔
When you have an all-male exec team, and desperately need a new engineering leader, but all of the qualified candidates in your pipeline are men
When you had to let someone go for cause, and they’re going around publicly lying about what happened but you can’t respond

These things happen. And when they do, you have a legal and ethical responsibility to make the decision that is right for the company, every time.

And yet.

You must remind yourself as you do, uneasily, queasily, how easily “I didn’t have a choice” can slip from reason to excuse. How quickly “this isn’t the right time” turns into “never the right time”. You know this, I know this, and I guarantee you every one of your employees knows this.

Don’t expect them to give you the benefit of the doubt. Why should they? They’ve heard this shit a million times. Don’t get mad, just do your job.

Living your values takes planning and sacrifice

No halfway decent leader spends ALL their time reacting to the burning bushes in front of their faces. Being a leader means planning for the future, so you can do better next time.

So you had to make a tough decision, and the optics (and maybe the reality) of it are terrible. Okay. It happens. Don’t just wince and put it behind you. If you don’t take steps to change things, you’re going to face the same bad choices next time.

What will you do differently?
Why were there no good alternatives?
What will the right time look like? How will you know?
How will you do a better job of recruiting, retaining, or setting them up for success?

If you don’t spend time, money, attention, or political capital on it, you don’t care about it, by definition. And it is a thousand times worse to claim you value something, and then demonstrate with your actions that you don’t care, than to never claim it in the first place.

Your resources are limited, and you must spend them with purpose

As an exec, you get a very limited amount of people’s time and attention — maybe a few minutes per week, or per month. Don’t waste them.

Jess Mink, our director of platform engineering, has a lovely story about this. They work with local search and rescue teams, which are staffed by people all over the political spectrum. The mission is crystal clear; all of them know why they’re there, and they don’t talk about things that aren’t tied to the mission. Yet Jess is giving a talk about pronouns at their next training. Why?

”Because there’s a really crisp, clear mission, I can say, I don’t care what your politics are. I’m not asking you to change your beliefs, but this is the impact of what you’re doing on these people that you’ve said you’re here to help.” ~ Jess Mink

There are a million things in the world you could say or do that would have intrinsic value. Why this thing? You should have a reason, and it should connect to your mission or your strategy for achieving it, or you are just muddying the waters.

Should political speech at work be a free-for-all?

Many leaders have opted to ban political speech at work. What’s the alternative, a free-for-all? Trump gifs and ~~Biden~~ Harris banners and a heated debate on the border in #general?

Please. Nobody wants that. Most folks seem to understand that work Slack is not the place for proselytizing or stirring up shit. There’s an element of good judgment here that extends well beyond political speech to include other disruptive actions such as criticizing religious beliefs, oversharing extremely personal info, posting sexy selfies, or good old verbal diarrhea. These are all, shall we say, “good coaching opportunities”. You don’t have to ban all political speech just to enforce reasonable norms.

In general, people want to work in an environment that is relatively peaceful and neutral-feeling, where people can focus on their work and our shared mission. But people also need spaces to talk about what’s going on in their lives and process their reactions.

At Honeycomb, we prefix all non-work slack channels with #misc. We have #misc-bible-reading-group, #misc-politics, #misc-book-club, #misc-shoes-and-fashion, #misc-so-fuzzy (for pictures of people’s pets).

People don’t join those channels automatically upon being hired — you have to seek them out, and you can leave them just as easily. Nobody has to worry about missing out on critical work conversations co-mingling with off putting political speech. And it’s easy to redirect non-work chatter out of work channels.

The value (and limitations) of neutrality

Neutral spaces are a good thing — a societal necessity. However, it becomes a problem when it fails to honor the paradox of tolerance — that if we tolerate the intolerant, intolerance will ultimately dominate. We cannot be equally tolerant of gay people and people who hate gay people, in other words.

At their worst, statements of neutrality punish the victimized and protect the victimizers. As Yonatan Zunger puts it, in one of my favorite essays of all time, “Tolerance is not a moral absolute; it is a peace treaty.”

But even peace treaties have their limits. Some problems are just fucking hard . As Emily put it,

“What does it mean to feel silence from the majority of your coworkers on a topic that feels like life and death to you? In normal times, silence can seem like a lack of political speech; in extraordinary times, silence speaks volumes. This creates division, even if your coworkers have landed there through ignorance or low awareness.” ~ Emily Nakashima

The hard thing about hard things is that they’re really fucking hard. There is no playbook. I can’t solve them for you here. Every situation is unique, and the details matter — details really matter, in fact. You can only take each situation as it comes with humility, sensitivity, and a willingness to listen.

Good leaders don’t invite unnecessary controversy

If you are a CEO or founder, especially, the things you say will be heard as representing the views of your company. Period. Keep this in mind, and try to be extra respectful and responsible. You don’t want your big mouth to accidentally create a wave of distraction and drama for people throughout your company to have to deal with. Your opinions are more than just your opinions.

If you’re thinking that I’m an odd person to be delivering this particular message, I sheepishly acknowledge the truth of this.

If you work at a company where the CEO and leadership openly espouse a particular set of partisan beliefs, you are inevitably going to feel somewhat othered. You wonder uncomfortably whether or not they are aware you hold different beliefs. If so, will you be promoted, will you be given equal opportunities? Would your leaders like you as much as they like employees who share their political convictions? Would they be as willing to chat with you or hang out with you? Does it matter?

People aren’t wrong to be concerned. There’s scads of research that shows how much we automatically prefer people who are more like us. It’s automatic — it’s natural. That doesn’t mean it’s right. Nor is it inevitable. We have to work harder to give an equal shot to those who aren’t like us, and we should do that.

Good leaders don’t make it all about them

One of the hardest parts about being a good leader is managing your ego, and keeping it from taking center stage or making things worse.

I have done and said a lot of dumb things online, but the worst of them was probably during the Black Lives Matter protests of 2020. I was trying to express my support, so I tweeted something about how actions matter more than words, and that we were trying to help by building a workplace where Black employees could thrive, or something like that. I don’t remember exactly (and the tweets are gone), but it was awful. I made it about us; it was super tone deaf. And I got whaled on, in a way that really threw me for a loop. I tried to apologize and made it worse. Friends blocked me.

It took me a long time to process the experience and come to terms with my mistakes: first, by framing my comments almost like a promo for how great honeycomb was, and second, by reacting so defensively when called out over it.

You don’t need to have a take on everything. And the more you have a track record of taking stances on issues, the more it’s expected of you, and the more dicey it becomes, because even not taking a stance is taking a stance.

Good leaders look for ways to de-escalate

Any time the conversation sails into the terrain of morals and ethics, it’s an automatic escalation. It raises the stakes, it exacerbates differences. It can transform an ordinary, practical matter into the forces of good versus evil in the blink of an eye.

There are bright lines and moral dilemmas in business. (Should you pay women less than men for the same work? No.) But most of our everyday work doesn’t need to be so emotionally fraught.

An example may help here. When you have a geographically distributed company, you have two basic choices when it comes to comp philosophy:

have a single set of comp bands, which apply no matter where you live
peg their salary to their local cost of living

When this question first came up, back in 2019, I came out swinging for the fences on option 1). I treated it like a moral question, a matter of basic human equity. “What kind of company would dare pay you less money based on where you live? What business is it of theirs where you live?” — that sort of thing.

In this, I was hardly alone. A lot of people have really strong feelings about this (I still have some pretty strong feelings about it 😬). But there are also some pretty reasonable arguments for and circumstances in which geo wage arbitrage makes a lot of sense, and can offer more opportunities to more people than you could otherwise afford. It’s not as simple as I made it out to be.

Having taken such a strong stance though, I have definitely made it extremely difficult for our finance team to change that policy, should we ever decide to.

Good leaders turn the volume down. They dampen drama, they don’t amplify it. They don’t ratchet up the stakes or the rhetoric, they look for practical solutions where possible.

Good leaders connect the culture to the mission

I started off as one of those leaders who cared more about culture than the business. In honesty, I assumed we’d fail. I never planned to start a company, it was an impulse decision. I really didn’t think I’d have to be the CEO. I wasn’t equipped for the job; I didn’t even know the difference between sales and marketing. I did however have MANY strong opinions on company culture.

The first few years of Honeycomb, any time I thought of some neat thing to try, I did it. Put an employee on the board? Yes! Run regular ethics discussions? Hell yes! Put together cross-functional teams to discuss company values? Cool!

I don’t regret it, precisely; I think it played a role in instilling a culture of curiosity and ownership. I think it helped us figure out who we were.

But as we grow past 200 people, and as the pace of growth accelerates, I am increasingly aware of the opportunity cost of these experiments. It doesn’t mean we don’t do things like this anymore, but there needs to be a much better reason than “Charity thinks it would be cool.” It needs to add up to something bigger.

Good leaders have conviction, and don’t pretend to give a shit when they don’t

I appreciate it when leaders do real talk about their values and how they make decisions. Too many leaders hide behind the bland slogans of corporate piety, in ways that tell you nothing about how they make decisions or where their priorities lie when the chips are down.

Honestly, I would rather work for someone who holds different values than I do, but who seems honest and consistent and fair-minded in their decision-making, than someone who holds the same values but whose decisions seem impulsive and subjective.

This is a business, not a family. If I believe in the mission, and the leaders and I align on the facts, and I respect their integrity and the way they make decisions, that matters more.

As it turns out, all of this has been said before…by my antagonists?!? Oh dear…

As I was wrapping up this article, I went back and read a few of the pieces written by and about the companies who banned political speech, and my mouth literally dropped open.

You could copy-paste entire sections between my article and theirs, without anyone knowing the difference.

Companies exist for the sake of their mission, check. They don’t have to have a take on everything, check. Your work day shouldn’t consist of arguments over abortion and other hot button topics, check. It IS distracting. It’s NOT why you’re here. Uh…

How can I have written the same fucking article as theirs, and come to such a radically different conclusion?

Or is it that radically different? After all, I’m not out here advocating a free-for-all, or that companies should take a stand on every social issue of the day. I actually pretty much agree with most of the sentiments these founders wrote in their official posts on the matter.

Shit?

I was sitting here having a legit internal crisis, and then I stumbled into some other pieces, where rank-and-file employees were talking about the changes and what led up to them.

Employees say the founders’ memos unfairly depicted their workplace as being riven by partisan politics, when in fact the main source of the discussion had always been Basecamp itself.

“At least in my experience, it has always been centered on what is happening at Basecamp,” said one employee. “What is being done at Basecamp? What is being said at Basecamp? And how it is affecting individuals? It has never been big political discussions, like ‘the postal service should be disbanded,’ or ‘I don’t like Amy Klobuchar.’

The whole article is required reading. It goes on to detail a hair-raising amount of hypocrisy and high-handed behaviors by the Basecamp founders; a bunch of workers who self-organized to improve internal hiring practices and culture, and how they got shut down.

“There’s always been this kind of unwritten rule at Basecamp that the company basically exists for David and Jason’s enjoyment,” one employee told me. “At the end of the day, they are not interested in seeing things in their work timeline that make them uncomfortable, or distracts them from what they’re interested in. And this is the culmination of that.”

Then there was this damning piece from the NYTimes about the appalling way Black employees were treated at Coinbase, and this one, which closes with an anecdote about the Coinbase CEO tweeting out his own (noxious) political views in direct contradiction of his own policies. Oopsie-daisy. 🌼

Are these policies designed to protect the mission, or the CEO?

All of this paints a very different picture. These bans on political speech seem to be less about protecting the commons from wayward employees who won’t stop distracting everyone with hot button political arguments, and more about employees doing their level best to grapple with real tensions and systemic problems at work — problems that their leaders got sick of hearing about and decided to shut down.

There’s a real stench of “politics for me, but not for thee!” in a lot of these cases, which makes it extra galling. At the beginning of this piece, I noted that “politics” is an obscenely broad category — it can mean almost anything. So when the CEO arrogates to himself the right to define it and silence it, it generates a lot of confusion and uncertainty. That’s bad for the mission!

The fact is, this shit is hard. It’s hard to craft a strategy and execute. It’s hard to train managers to have hard conversations with their employees, or gently de-escalate when things get emotionally fraught. It’s hard to reset expectations on how much of a voice employees can expect to have in a given area. It’s hard to know when to take a stand on principle, and back it with your time and treasure, and when to settle or compromise.

But you signed up for this, bro. It’s part of the job, and you’re getting paid a lot of money to do it. You don’t get to just nope out when the going gets rough.

Just because you made a rule that people can’t talk about the hard stuff, doesn’t mean the hard stuff goes away. It mostly just serves to reinforce whatever power structures and inequities already exist in your company. Which means a lot of people will go on doing just fine, while some are totally fucked. You’ve also shut down all of the reasonable routes for people to advocate for change, so good job, you.

You don’t have to agree with them, but you do have to be respectful to your employees

Look, none of us are perfect. That’s why systems need mechanisms for change. Resiliency isn’t about never breaking the system, it’s about knowing your systems will break, and equipping them with the tools to repair.

If you want to lead a company, you have to deal with the people. It comes with the job.

If you want your people to care as much about the mission as you do, to feel personally invested in its success, to devote whole long stretches of their brilliant, creative, busy lives to helping you make that mission come true…you owe them in return.

If a bunch of your employees are waving a flag and urgently saying “we have a problem”, they are very likely doing you a favor. Either way, they deserve to be heard.

You don’t have to do what they want. But you ought to listen to them, and reserve judgment. Open your eyes. Look around. Do some reading. Talk to people. Consider whether you might be missing something. Then make a decision and give an honest answer. They may or may not agree, and they may or may not choose to stay, but that’s what treating them with respect looks like, just like you ask them to treat you, and each other.

To instead say “Sorry, your feedback is a distraction from the mission and will no longer be tolerated” is so unbelievably disrespectful, and wrapping your decision in the noble flag of the mission is dishonest. It’s hard to tell sometime whether people are deluding themselves or only trying to delude other people, but holy shit, what a doozy.

Good leaders know they will make mistakes, and when they do, they own them, apologize properly, and fix them. They do not use their power to silence people and then swagger around like they own the moral high ground.

Generative AI is not going to build your engineering team for you

June 10, 2024December 21, 2024 mipsytipsyartificial intelligence, code generation, generative AI, hiring, junior engineers, llmsLeave a comment

Originally posted on the Stack Overflow blog on June 10th, 2024

When I was 19 years old, I dropped out of college and moved to San Francisco. I had a job offer in hand to be a Unix sysadmin for Taos Consulting. However, before my first day of work I was lured away to a startup in the city, where I worked as a software engineer on mail subsystems.

I never questioned whether or not I could find work. Jobs were plentiful, and more importantly, hiring standards were very low. If you knew how to sling HTML or find your way around a command line, chances were you could find someone to pay you.

Was I some kind of genius, born with my hands on a computer keyboard? Assuredly not. I was homeschooled in the backwoods of Idaho. I didn’t touch a computer until I was sixteen and in college. I escaped to university on a classical performance piano scholarship, which I later traded in for a peripatetic series of nontechnical majors: classical Latin and Greek, musical theory, philosophy. Everything I knew about computers I learned on the job, doing sysadmin work for the university and CS departments.

In retrospect, I was so lucky to enter the industry when I did. It makes me blanch to think of what would have happened if I had come along a few years later. Every one of the ladders my friends and I took into the industry has long since vanished.

To some extent, this is just what happens as an industry matures. The early days of any field are something of a Wild West, where the stakes are low, regulation nonexistent, and standards nascent. If you look at the early history of other industries—medicine, cinema, radio—the similarities are striking.

There is a magical moment with any young technology where the boundaries between roles are porous and opportunity can be seized by anyone who is motivated, curious, and willing to work their asses off.

It never lasts. It can’t; it shouldn’t. The amount of prerequisite knowledge and experience you must have before you can enter the industry swells precipitously. The stakes rise, the magnitude of the mission increases, the cost of mistakes soars. We develop certifications, trainings, standards, legal rites. We wrangle over whether or not software engineers are really engineers.

Nowadays, you wouldn’t want a teenaged dropout like me to roll out of junior year and onto your pager rotation. The prerequisite knowledge you need to enter the industry has grown, the pace is faster, and the stakes are much higher, so you can no longer learn literally everything on the job, as I once did.

However, it’s not like you can learn everything you need to know at college either. A CS degree typically prepares you better for a life of computing research than life as a workaday software engineer. A more practical path into the industry may be a good coding bootcamp, with its emphasis on problem solving and learning a modern toolkit. In either case, you don’t so much learn “how to do the job” as you do “learn enough of the basics to understand and use the tools you need to use to learn the job.”

Software is an apprenticeship industry. You can’t learn to be a software engineer by reading books. You can only learn by doing…and doing, and doing, and doing some more. No matter what your education consists of, most learning happens on the job—period. And it never ends! Learning and teaching are lifelong practices; they have to be, the industry changes so fast.

It takes a solid seven-plus years to forge a competent software engineer. (Or as most job ladders would call it, a “senior software engineer”.) That’s many years of writing, reviewing, and deploying code every day, on a team alongside more experienced engineers. That’s just how long it seems to take.

Here is where I often get some very indignant pushback to my timelines, e.g.:

“Seven years?! Pfft, it took me two years!”

“I was promoted to Senior Software Engineer in less than five years!”

Good for you. True, there is nothing magic about seven years. But it takes time and experience to mature into an experienced engineer, the kind who can anchor a team. More than that, it takes practice.

I think we have come to use “Senior Software Engineer” as shorthand for engineers who can ship code and be a net positive in terms of productivity, and I think that’s a huge mistake. It implies that less senior engineers must be a net negative in terms of productivity, which is untrue. And it elides the real nature of the work of software engineering, of which writing code is only a small part.

To me, being a senior engineer is not primarily a function of your ability to write code. It has far more to do with your ability to understand, maintain, explain, and manage a large body of software in production over time, as well as the ability to translate business needs into technical implementation. So much of the work is around crafting and curating these large, complex sociotechnical systems, and code is just one representation of these systems.

What does it mean to be a senior engineer? It means you have learned how to learn, first and foremost, and how to teach; how to hold these models in your head and reason about them, and how to maintain, extend, and operate these systems over time. It means you have good judgment, and instincts you can trust.

Which brings us to the matter of AI.

It is really, really tough to get your first role as an engineer. I didn’t realize how hard it was until I watched my little sister (new grad, terrific grades, some hands on experience, fiendishly hard worker) struggle for nearly two years to land a real job in her field. That was a few years ago; anecdotally, it seems to have gotten even harder since then.

This past year, I have read a steady drip of articles about entry-level jobs in various industries being replaced by AI. Some of which absolutely have merit. Any job that consists of drudgery such as converting a document from one format to another, reading and summarizing a bunch of text, or replacing one set of icons with another, seems pretty obviously vulnerable. This doesn’t feel all that revolutionary to me, it’s just extending the existing boom in automation to cover textual material as well as mathy stuff.

Recently, however, a number of execs and so-called “thought leaders” in tech seem to have genuinely convinced themselves that generative AI is on the verge of replacing all the work done by junior engineers. I have read so many articles about how junior engineering work is being automated out of existence, or that the need for junior engineers is shriveling up. It has officially driven me bonkers.

All of this bespeaks a deep misunderstanding about what engineers actually do. By not hiring and training up junior engineers, we are cannibalizing our own future. We need to stop doing that.

People act like writing code is the hard part of software. It is not. It never has been, it never will be. Writing code is the easiest part of software engineering, and it’s getting easier by the day. The hard parts are what you do with that code—operating it, understanding it, extending it, and governing it over its entire lifecycle.

A junior engineer begins by learning how to write and debug lines, functions, and snippets of code. As you practice and progress towards being a senior engineer, you learn to compose systems out of software, and guide systems through waves of change and transformation.

Sociotechnical systems consist of software, tools, and people; understanding them requires familiarity with the interplay between software, users, production, infrastructure, and continuous changes over time. These systems are fantastically complex and subject to chaos, nondeterminism and emergent behaviors. If anyone claims to understand the system they are developing and operating, the system is either exceptionally small or (more likely) they don’t know enough to know what they don’t know. Code is easy, in other words, but systems are hard.

The present wave of generative AI tools has done a lot to help us generate lots of code, very fast. The easy parts are becoming even easier, at a truly remarkable pace. But it has not done a thing to aid in the work of managing, understanding, or operating that code. If anything, it has only made the hard jobs harder.

If you read a lot of breathless think pieces, you may have a mental image of software engineers merrily crafting prompts for ChatGPT, or using Copilot to generate reams of code, then committing whatever emerges to GitHub and walking away. That does not resemble our reality.

The right way to think about tools like Copilot is more like a really fancy autocomplete or copy-paste function, or maybe like the unholy love child of Stack Overflow search results plus Google’s “I feel lucky”. You roll the dice, every time.

These tools are at their best when there’s already a parallel in the file, and you want to just copy-paste the thing with slight modifications. Or when you’re writing tests and you have a giant block of fairly repetitive YAML, and it repeats the pattern while inserting the right column and field names, like an automatic template.

However, you cannot trust generated code. I can’t emphasize this enough. AI-generated code always looks quite plausible, but even when it kind of “works”, it’s rarely congruent with your wants and needs. It will happily generate code that doesn’t parse or compile. It will make up variables, method names, function calls; it will hallucinate fields that don’t exist. Generated code will not follow your coding practices or conventions. It is not going to refactor or come up with intelligent abstractions for you. The more important, difficult or meaningful a piece of code is, the less likely you are to generate a usable artifact using AI.

You may save time by not having to type the code in from scratch, but you will need to step through the output line by line, revising as you go, before you can commit your code, let alone ship it to production. In many cases this will take as much or more time as it would take to simply write the code—especially these days, now that autocomplete has gotten so clever and sophisticated. It can be a LOT of work to bring AI-generated code into compliance and coherence with the rest of your codebase. It isn’t always worth the effort, quite frankly.

Generating code that can compile, execute, and pass a test suite isn’t especially hard; the hard part is crafting a code base that many people, teams, and successive generations of teams can navigate, mutate, and reason about for years to come.

So that’s the TLDR: you can generate a lot of code, really fast, but you can’t trust what comes out. At all. However, there are some use cases where generative AI consistently shines.

For example, it’s often easier to ask chatGPT to generate example code using unfamiliar APIs than by reading the API docs—the corpus was trained on repositories where the APIs are being used for real life workloads, after all.

Generative AI is also pretty good at producing code that is annoying or tedious to write, yet tightly scoped and easy to explain. The more predictable a scenario is, the better these tools are at writing the code for you. If what you need is effectively copy-paste with a template—any time you could generate the code you want using sed/awk or vi macros—generative AI is quite good at this.

It’s also very good at writing little functions for you to do things in unfamiliar languages or scenarios. If you have a snippet of Python code and you want the same thing in Java, but you don’t know Java, generative AI has got your back.

Again, remember, the odds are 50/50 that the result is completely made up. You always have to assume the results are incorrect until you can verify it by hand. But these tools can absolutely accelerate your work in countless ways.

One of the engineers I work with, Kent Quirk, describes generative AI as “an excitable junior engineer who types really fast”. I love that quote—it leaves an indelible mental image.

Generative AI is like a junior engineer in that you can’t roll their code off into production. You are responsible for it—legally, ethically, and practically. You still have to take the time to understand it, test it, instrument it, retrofit it stylistically and thematically to fit the rest of your code base, and ensure your teammates can understand and maintain it as well.

The analogy is a decent one, actually, but only if your code is disposable and self-contained, i.e. not meant to be integrated into a larger body of work, or to survive and be read or modified by others.

And hey—there are corners of the industry like this, where most of the code is write-only, throwaway code. There are agencies that spin out dozens of disposable apps per year, each written for a particular launch or marketing event and then left to wither on the vine. But that is not most software. Disposable code is rare; code that needs to work over the long term is the norm. Even when we think a piece of code will be disposable, we are often (urf) wrong.

In that particular sense—generating code that you know is untrustworthy—GenAI is a bit like a junior engineer. But in every other way, the analogy fails. Because adding a person who writes code to your team is nothing like autogenerating code. That code could have come from anywhere—Stack Overflow, Copilot, whatever. You don’t know, and it doesn’t really matter. There’s no feedback loop, no person on the other end trying iteratively to learn and improve, and no impact to your team vibes or culture.

To state the supremely obvious: giving code review feedback to a junior engineer is not like editing generated code. Your effort is worth more when it is invested into someone else’s apprenticeship. It’s an opportunity to pass on the lessons you’ve learned in your own career. Even just the act of framing your feedback to explain and convey your message forces you to think through the problem in a more rigorous way, and has a way of helping you understand the material more deeply.

And adding a junior engineer to your team will immediately change team dynamics. It creates an environment where asking questions is normalized and encouraged, where teaching as well as learning is a constant. We’ll talk more about team dynamics in a moment.

The time you invest into helping a junior engineer level up can pay off remarkably quickly. Time flies. ☺️ When it comes to hiring, we tend to valorize senior engineers almost as much as we underestimate junior engineers. Neither stereotype is helpful.

People seem to think that once you hire a senior engineer, you can drop them onto a team and they will be immediately productive, while hiring a junior engineer will be a tax on team performance forever. Neither are true. Honestly, most of the work that most teams have to do is not that difficult, once it’s been broken down into its constituent parts. There’s plenty of room for lower level engineers to execute and flourish.

The grossly simplified perspective of your accountant goes something like this. “Why should we pay $100k for a junior engineer to slow things down, when we could pay $200k for a senior engineer to speed things up?” It makes no sense!

But you know and I know—every engineer who is paying attention should know—that’s not how engineering works. This is an apprenticeship industry, and productivity is defined by the output and carrying capacity of each team, not each person.

There are lots of ways a person can contribute to the overall velocity of a team, just like there are lots of ways a person can sap the energy out of a team or add friction and drag to everyone around them. These do not always correlate with the person’s level (at least not in the direction people tend to assume), and writing code is only one way.

Furthermore, every engineer you hire requires ramp time and investment before they can contribute. Hiring and training new engineers is a costly endeavor, no matter what level they are. It will take any senior engineer time to build up their mental model of the system, familiarize themselves with the tools and technology, and ramp up to speed. How long? It depends on how clean and organized the codebase is, past experience with your tools and technologies, how good you are at onboarding new engineers, and more, but likely around 6-9 months. They probably won’t reach cruising altitude for about a year.

Yes, the ramp will be longer for a junior engineer, and yes, it will require more investment from the team. But it’s not indefinite. Your junior engineer should be a net positive within roughly the same time frame, six months to a year, and they develop far more rapidly than more senior contributors. (Don’t forget, their contributions may vastly exceed the code they personally write.)

In terms of writing and shipping features, some of the most productive engineers I’ve ever known have been intermediate engineers. Not yet bogged down with all the meetings and curating and mentoring and advising and architecture, their calendars not yet pockmarked with interruptions, they can just build stuff. You see them put their headphones on first thing in the morning, write code all day, and cruise out the door in the evening having made incredible progress.

Intermediate engineers sit in this lovely, temporary state where they have gotten good enough at programming to be very productive, but they are still learning how to build and care for systems. All they do is write code, reams and reams of code.

And they’re energized…engaged. They’re having fun! They aren’t bored with writing a web form or a login page for the 1000th time. Everything is new, interesting, and exciting, which typically means they will do a better job, especially under the light direction of someone more experienced. Having intermediate engineers on a team is amazing. The only way you get them is by hiring junior engineers.

Having junior and intermediate engineers on a team is a shockingly good inoculation against overengineering and premature complexity. They don’t yet know enough about a problem to imagine all the infinite edge cases that need to be solved for. They help keep things simple, which is one of the hardest things to do.

If you ask, nearly everybody will wholeheartedly agree that hiring junior engineers is a good thing…and someone else should do it. This is because the long-term arguments for hiring junior engineers are compelling and fairly well understood.

We need more senior engineers as an industry
Somebody has to train them
Junior engineers are cheaper
They may add some much-needed diversity
They are often very loyal to companies who invest in training them, and will stick around for years instead of job hopping
Did we already mention that somebody needs to do it?

But long-term thinking is not a thing that companies, or capitalism in general, are typically great at. Framed this way, it makes it sound like you hire junior engineers as a selfless act of public service, at great cost to yourself. Companies are much more likely to want to externalize costs like those, which is how we got to where we are now.

However, there are at least as many arguments to be made for hiring junior engineers in the short term—selfish, hard-nosed, profitable reasons for why it benefits the team and the company to do so. You just have to shift your perspective slightly, from individuals to teams, to bring them into focus.

Let’s start here: hiring engineers is not a process of “picking the best person for the job”. Hiring engineers is about composing teams. The smallest unit of software ownership is not the individual, it’s the team. Only teams can own, build, and maintain a corpus of software. It is inherently a collaborative, cooperative activity.

If hiring engineers was about picking the “best people”, it would make sense to hire the most senior, experienced individual you can get for the money you have, because we are using “senior” and “experienced” as a proxy for “productivity”. (Questionable, but let’s not nitpick.) But the productivity of each individual is not what we should be optimizing for. The productivity of the team is all that matters.

And the best teams are always the ones with a diversity of strengths, perspectives, and levels of expertise. A monoculture can be spectacularly successful in the short term—it may even outperform a diverse team. But they do not scale well, and they do not adapt to unfamiliar challenges gracefully. The longer you wait to diversify, the harder it will be.

We need to hire junior engineers, and not just once, but consistently. We need to keep feeding the funnel from the bottom up. Junior engineers only stay junior for a couple years, and intermediate engineers turn into senior engineers. Super-senior engineers are not actually the best people to mentor junior engineers; the most effective mentor is usually someone just one level ahead, who vividly remembers what it was like in your shoes.

A healthy team is an ecosystem. You wouldn’t staff a product engineering team with six DB experts and one mobile developer. Nor should you staff it with six staff+ engineers and one junior developer. A good team is composed of a range of skills and levels.

Have you ever been on a team packed exclusively with staff or principal engineers? It is not fun. That is not a high-functioning team. There is only so much high-level architecture and planning work to go around, there are only so many big decisions that need to be made. These engineers spend most of their time doing work that feels boring and repetitive, so they tend to over-engineer solutions and/or cut corners—sometimes at the same time. They compete for the “fun” stuff and find reasons to pick technical fights with each other. They chronically under-document and under-invest in the work that makes systems simple and tractable.

Teams that only have intermediate engineers (or beginners, or seniors, or whatever) will have different pathologies, but similar problems with contention and blind spots. The work itself has a wide range in complexity and difficulty—from simple, tightly scoped functions to tough, high-stakes architecture decisions. It makes sense for the people doing the work to occupy a similar range.

The best teams are ones where no one is bored, because every single person is working on something that challenges them and pushes their boundaries. The only way you can get this is by having a range of skill levels on the team.

The bottleneck we face now is not our ability to train up new junior engineers and give them skills. Nor is it about juniors learning to hustle harder; I see a lot of solid, well-meaning advice on this topic, but it’s not going to solve the problem. The bottleneck is giving them their first jobs. The bottleneck consists of companies who see them as a cost to externalize, not an investment in their—the company’s—future.

After their first job, an engineer can usually find work. But getting that first job, from what I can see, is murder. It is all but impossible—if you didn’t graduate from a top college, and you aren’t entering the feeder system of Big Tech, then it’s a roll of the dice, a question of luck or who has the best connections. It was rough before the chimera of “Generative AI can replace junior engineers” rose up from the swamp. And now…oof.

Where would you be, if you hadn’t gotten into tech when you did?

I know where I would be, and it is not here.

The internet loves to make fun of Boomers, the generation that famously coasted to college, home ownership, and retirement, then pulled the ladder up after them while mocking younger people as snowflakes. “Ok, Boomer” may be here to stay, but can we try to keep “Ok, Staff Engineer” from becoming a thing?

Lots of people seem to think we don’t need junior engineers, but nobody is arguing that we need fewer senior engineers, or will need fewer senior engineers in the foreseeable future.

I think it’s safe to assume that anything deterministic and automatable will eventually be automated. Software engineering is no different—we are ground zero! Of course we’re always looking for ways to automate and improve efficiency, as we should be.

But large software systems are unpredictable and nondeterministic, with emergent behaviors. The mere existence of users injects chaos into the system. Components can be automated, but complexity can only be managed.

Even if systems could be fully automated and managed by AI, the fact that we cannot understand how AI makes decisions is a huge, possibly insurmountable problem. Running your business on a system that humans can’t debug or understand seems like a risk so existential that no security, legal or finance team would ever sign off on it. Maybe some version of this future will come to pass, but it’s hard to see it from here. I would not bet my career or my company on it happening.

In the meantime, we still need more senior engineers. The only way to grow them is by fixing the funnel.

No. You need to be able to set them up for success. Some factors that disqualify you from hiring junior engineers:

You have less than two years of runway
Your team is constantly in firefighting mode, or you have no slack in your system
You have no experienced managers, or you have bad managers, or no managers at all
You have no product roadmap
Nobody on your team has any interest in being their mentor or point person

The only thing worse than never hiring any junior engineers is hiring them into an awful experience where they can’t learn anything. (I wouldn’t set the bar quite as high as Cindy does in this article; while I understand where she’s coming from, it is so much easier to land your second job than your first job that I think most junior engineers would frankly choose a crappy first job over none at all.)

Being a fully distributed company isn’t a complete dealbreaker, but it does make things even harder. I would counsel junior engineers to seek out office jobs if at all possible. You learn so much faster when you can soak up casual conversations and technical chatter, and you lose that working from home. If you are a remote employer, know that you will need to work harder to compensate for this. I suggest connecting with others who have done this successfully (they exist!) for advice.

I also advise companies not to start by hiring a single junior engineer. If you’re going to hire one, hire two or three. Give them a cohort of peers, so it’s a little less intimidating and isolating.

I have come to believe that the only way this will ever change is if engineers and engineering managers across our industry take up this fight and make it personal.

Most of the places I know that do have a program for hiring and training entry level engineers, have it only because an engineer decided to fight for it. Engineers—sometimes engineering managers—were the ones who made the case and pushed for resources, then designed the program, interviewed and hired the junior engineers, and set them up with mentors. This is not an exotic project, it is well within the capabilities of most motivated, experienced engineers (and good for your career as well).

Finance isn’t going to lobby for this. Execs aren’t likely to step in. The more a person’s role inclines them to treat engineers like fungible resources, the less likely they are to understand why this matters.

AI is not coming to solve all our problems and write all our code for us—and even if it was, it wouldn’t matter. Writing code is but a sliver of what professional software engineers do, and arguably the easiest part. Only we have the context and the credibility to drive the changes we know form the bedrock for great teams and engineering excellence..

Great teams are how great engineers get made. Nobody knows this better than engineers and EMs. It’s time for us to make the case, and make it happen.

The Cost Crisis in Observability Tooling

January 24, 2024December 21, 2024 mipsytipsycost control, metrics, observability 2.0, three pillarsLeave a comment

Originally posted on the Honeycomb blog on January 24th, 2024

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying.

In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up.

Observability 1.0 and the cost multiplier effect

Are you familiar with this chestnut?

“Observability has three pillars: metrics, logs, and traces.”

This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say definitionally true of a particular generation of tools. Let’s call it “observability 1.0.”

From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes.

Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also also be collecting:

Profiling data
Product analytics
Business intelligence data
Database monitoring/query profiling tools
Mobile app telemetry
Behavioral analytics
Crash reporting
Language-specific profiling data
Stack traces
CloudWatch or hosting provider metrics
…and so on.

So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.)

There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood there are really only three basic data types: the metric, unstructured logs, and structured logs. Each of these have their own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them.

Metrics

Metrics are the great-granddaddy of telemetry formats; tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the context of the request gets discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data.

Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance.

When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application.

Your metrics bill is usually dominated by the cost of these custom metrics. At minimum, your bill goes up linearly with the number of custom metrics you create. Which is unfortunate, because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future, and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight.

Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design).

Nobody can understand their software or their systems with a metrics tool alone, because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs.

Unstructured logs

You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line, plus the request ID. Unstructured logs are still the default, although this is slowly changing.

The cost of unstructured logs is driven by a few things:

Write amplification. If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request.
Noisiness. It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up.
Constraints on physical resources. Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume:
- Log levels
- Consistent hashes
- Dumb sample rates

When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes over half the bits are consumed by request ID, process ID, timestamp. This can be quite meaningful from a cost perspective.

All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is full text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it.

Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown-unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down.

Structured logs

Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps!

Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data.

There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability:

Instrument your code using the principles of canonical logs, which collects all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this, for reasons of usefulness and usability as well as cost control.
Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool.
Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions future you can search or compute based on.
Use a storage engine that supports high cardinality, with an explorable interface.

If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second.

Ballooning costs are baked into observability 1.0

To recap: high costs are baked into the observability 1.0 model. Every pillar has a price.

You have to collect and store your data—and pay to store it—again and again and again, for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more.

It gets worse. As your costs go up, the value you get out of your tools goes down.

Your logs get slower and slower to search.
You have to know what you’re searching for in order to find it.
You have to use blunt force sampling technique to keep log volume from blowing up.
Any time you want to be able to ask a new question, you first have to commit new code and deploy it.
You have to guess which custom metrics you’ll need and which fields to index in advance.
As volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately.

And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. Which means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs, and logs with traces—exists only in the heads of a few people.

At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, choosing and maintaining indexes.

But all this the time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up, due to low visibility and thus low confidence.

Is there an alternative? Yes.

The cost model of observability 2.0 is very different

Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily-wide structured log events, also known as spans. From these wide, context-rich structured log events you can derive the other data types (metrics, logs, or traces).

Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations.

You also effectively get infinite custom metrics, since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.”

Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers.

Engineering time is your most precious resource

But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly, with confidence.

Modern software engineering is all about hooking up fast feedback loops. And observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops.

Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things.

Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0, when your bill goes up, the value you derive from your telemetry goes up too. You pay more money, you get more out of your tools. As you should.

Interested in learning more? We’ve written at length about the technical prerequisites for observability with a single source of truth (“observability 2.0” as we’ve called it here). Honeycomb was built to this spec; ServiceNow (formerly Lightstep) and Baselime are other vendors that qualify. Click here to get a Honeycomb demo.

CORRECTION: The original version of this document said that “nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But you’re right, it’s not nothing. 😸

Questionable Advice: “My boss says we don’t need any engineering managers. Is he right?”

January 5, 2024January 10, 2024 mipsytipsyadvice, culture, engineering management, hierarchy, organizations, sociotechnical18 Comments

I recently joined a startup to run an engineering org of about 40 engineers. My title is VP Engineering. However, I have been having lots of ongoing conflict with the CEO (a former engineer) around whether or not I am allowed to have or hire any dedicated engineering managers. Right now, the engineers are clustered into small teams of 3-4, each of which has a lead engineer — someone who leads the group, but whose primary responsibility is still writing code and shipping product.

I have headcount to hire more engineers in the coming year, but no managers. My boss says we are a startup and can’t afford such luxuries. It seems obvious to me that we need engineering managers, but to him, it seems just as obvious that managers are unnecessary overhead and that all hands should be on deck writing code at our stage.

I don’t know how to make that argument. It seems so obvious to me that I actually struggle to put it into words or make the case for why we should hire EMs. Help?

— Unnecessary Overhead(?!?)

Oh boy, there’s a lot to unpack here.

It is unsurprising to me that your CEO does not understand why managers exist, given that he does not seem to understand why organizational structures exist. 🙈 Why is he micromanaging how you are structuring your org or what roles you are allowed to fill? He hired you to do a job, and he’s not letting you do it. He can’t even explain why he isn’t letting you do it. This does not bode well.

But I do think it’s an interesting question. So let’s pretend he isn’t holding your ability to do your damn job hostage until you defend yourself to his satisfaction. 😒

I can think of two ways to make the case for engineering managers: one is rather complicated, from first principles, and the other very simple, but perhaps unsatisfying.

I personally have a … vigorous … knee-jerk response to authority; I hate being told what to do. It’s only recently that I’ve found my way to an understanding of hierarchy that feels healthy and practical, and that was by looking at it through the lens of systems theory.

Why does hierarchy exist in organizations?

It makes sense that hierarchy comes with a lot of baggage. Many of us have had bad experience with managers — indeed, entire organizations — where hierarchy was used as a tool of oppression, where people rose up the leadership ranks by hoarding information and playing dominance games, and decisions got made by pulling rank.

Working at a place like that fucking sucks. Who wants to invest their creativity and life force into a place that feels like a Dilbert cartoon, knowing how little it will be valued or reciprocated, and that it will slowly but surely get crushed out of you?

But hierarchy is not intrinsically authoritarian. Hierarchy did not originate as a political structure that humans invented for controlling and dominating one another, it is in fact a property of self-organizing systems, and it emerges for the benefit of the subsystems. In fact, hierarchy is absolutely critical to the adaptability, resiliency, and scalability of complex systems.

Let’s start with few basic facts about systems, for anyone that may be unfamiliar.

Hierarchy is a property of self-organizing systems

A system is “a network of interdependent components that work together to try to accomplish a common aim” (W. Edward Deming). A pile of sand is not a system, but a car is a system; if you take out its gas tank, the car cannot achieve its aim.

A subsystem is a collection of elements with a smaller aim inside a larger system. There can be many levels of subsystems that operate interdependently. The subsystems always work to support the needs of the larger system; if the subsystem instead optimizes for its own best interests, the whole system can fail (this is where the term “suboptimize” comes from 😄).

A system is self-organizing if it has the ability to make itself more complex, by diversifying, adapting, and improving itself. As systems self-organize and their complexity increases, they tend to generate hierarchy — an arrangement of systems and subsystems. In a stable, resilient and efficient system, subsystems can largely take care of themselves, regulate themselves, and serve the needs of the larger system, while the larger system coordinates between subsystems and helps them perform better.

Hierarchy minimizes the costs of coordination and reduces the amount of information that any given part of the system has to keep track of, preventing information overload. Information transfer and relationships within a subsystem are much more dense and have fewer delays than information transfer or relationships between subsystems.

(This should all sound pretty familiar to any software engineer. Modularization, amirite?? 😍)

Applying this definition, we can say that a manager’s job is to coordinate between teams and help their team perform better.

The false binary of sociotechnical systems

You’ve probably heard this canard: “Engineers do the technical work, managers do the people work.” I hate it. ☺️ I think it misconstrues the fundamental nature of sociotechnical systems. The “socio” and “technical” of sociotechnical systems are not neatly separable, they are interwoven and interdependent. There is actually precious little that is purely technical work or purely people work; there is a metric shitload of glue work that draws upon both skill sets.

Consider a very partial list of tasks done by any functional engineering org, besides writing code:

Recruiting, networking, interviewing, training interviewers, synthesizing feedback, writing job descriptions and career ladders
Project management for each project or commitment, prioritizing backlog, managing stakeholders and resolving conflicts, estimating size and scope, running retrospectives
Running team meetings, having 1x1s, giving continuous growth feedback, writing reviews, representing the team’s needs
Architecture, code review, refactoring; capturing DORA and productivity metrics, managing alert volume to prevent burnout

A lot of this work can be done by engineers, and often is. Every company distributes the load somewhat differently. This is a good thing! You don’t WANT an org where this work is only done by managers. You want individual contributors to help co-create the org and have a stake in how it gets run. Almost all of this work would be done more effectively by someone with an engineering background.

So you can understand why someone might hesitate to spend valuable headcount on engineering managers. Why wouldn’t you want everyone in engineering to be writing and shipping code as their primary job? Isn’t that by definition the best way to maximize productivity?

Ehhh… 😉

Engineering managers are a useful abstraction

In theory, you could make a list of all the tasks that need to be done to coordinate with other teams and have each item be picked up by a different person. In practice, this is impractical because then everybody would need to know about everything. One of the primary benefits of hierarchy, remember, is to reduce information overload. Intra-team communication should be high-bandwidth and fast, inter-team communication should be more sparse.

As the company scales, you can’t expect everybody to know everyone else; we need abstractions in order to function. A manager is the point of contact and representative for their team, and they serve as routers for important information.

Graph theory: for each of n engineers to be connected ("aligned") w each other you need n(n-1)/2 links, i.e. you need 780 interactions just to keep consistency. Without a few engineering managers (and arks) he is allowing velocity to build up tech debt at blazing speed.
— HurtingBrain (@HurtingBrain) January 5, 2024

I sometimes imagine managers as the nervous system of the company body, carrying around messages from one limb to another to coordinate actions. Centralizing many or most of these functions into one person lets you take advantage of specialization, as a manager builds relationships and context and improves at their role, and this massively reduces context switching for everyone else.

Manager calendars vs maker calendars

Engineering labor takes concentration and focus. Context switching is expensive, and too many interrupts can be fatal. Management labor consists of context switching every hour or so, and being available for interruptions throughout the day. These are two very different modes of being, headspaces, and calendar schedules, and do not coexist well.

In general, you want people to be able to spend most of their time working on things that contribute to the success of the outcomes they are directly responsible for. Engineers can only do so much glue work before their calendar turns into Swiss cheese and they can no longer deliver on their commitments. Since managers’ calendars are already Swiss cheese, it’s typically less disruptive for them to take on a larger share of glue labor.

It isn’t up to managers to do all the glue work, but it is a manager’s job to make sure that everything that needs to get done, does gets done. It is a manager’s job to try to line up every engineer with work that is interesting and challenging, but not overwhelming, and to ensure that unpleasant labor gets equitably distributed. It’s also a manager’s job to make sure that if we are asking someone to do a job, they are equipped with the resources they need to succeed at that job. Including time to focus.

Management is a tool for accountability

When you’re an engineer, you are responsible for the software you develop, deploy, and maintain. When you’re a manager, you are responsible for your team and the organization as a whole.

Management is one way of holding people accountable for specific outcomes (building teams with the right skills, relationships, and processes to make good decisions and build value for the company), and equipping them with the resources (budget, tools, headcount) to achieve those outcomes. If you aren’t making building the organization someone’s number one job, it won’t be anyone’s number one job, which means it probably won’t get done very well. And whose responsibility will that be, Mr. CEO?

There’s a real upper limit to what you can reasonably expect tech leads, or engineers, or anyone whose actual job is shipping software to do in their “spare time”. If you’re trying to hold your tech leads responsible for building healthy engineering teams, tools, and processes, you are asking them to do two calendarily incompatible jobs with only one calendar. The likeliest scenario is that they will focus on the outcomes they feel comfortable owning (the technical ones), while you pile up organizational debt in the background.

In natural hierarchies, we look up for purpose and down for function. That, in a nutshell, is the more complicated argument for why we need engineering managers.

Choose Boring technology Culture

The simpler argument is this: most engineering orgs have engineering managers. That’s the default. Lots of people much smarter than you or me have spent lots of time thinking and tinkering with org structures over the years, and this is what we’ve got.

As Dan McKinley famously said, we should “choose boring technology“. Boring doesn’t mean bad, it means the capabilities and failure conditions are well understood. You only ever get a few innovation tokens, so you should spend those wisely on core differentiators that could make or break your business. The same goes for culture. Do you really want to spend one of your tokens on org structure? Why??

For better or for worse, the hierarchical org structure is well understood. There are plenty of people on the job market who are proficient at managing or working with managers, and you can hire them. You can get training, coaching, or read a lot of self-help books. There are various management philosophies you can coalesce around or use to rule people out. On the other hand, the manager-free experiments I’m aware of (e.g. holacracy at Medium and GitHub, or “Choose Your Own Work” at Linden Lab) have all been quietly abandoned or outgrown. Not, in my experience, because leaders went mad for power, but due to chaos, lack of focus, and poor execution.

When there is no explicit structure or hierarchy, the result is not freedom and egalitarianism, it’s “informal, unacknowledged, and unaccountable leadership”, as famously detailed in “The Tyranny of Structureless“. In reality, sadly, these teams tend to be chaotic, fragile, and frustrating. I know! I’m pissed too! 😭

This argument doesn’t necessarily prove your CEO is wrong, but I should think his bar for proof is much higher than yours. “I don’t want any of my engineers to stop writing code” is not an argument. But I’m also feeling like I haven’t quite addressed the core question of productivity, so let’s pick that up again once more.

More lines of code != more productivity

To briefly recap: we were talking about an org with ~40 engineers, broken up into 10 small clusters of 3-4 engineers, each with a tech lead. Your CEO is arguing that you can’t afford to lose any velocity, which he thinks is what would happen if anyone stops writing code full time.

Maybe. But everything I have ever experienced leads me to believe that a fewer number of larger teams, each helmed by an experienced engineering manager, should way outperform this gaggle of tiny groups. It’s not even close. And they can do so in a way that’s more efficient, sustainable, and humane than this scrappy death march.

And systems thinking shows us why! With fewer groups, but larger ones, you have less overall management overhead, and much less of the slow and costly intra-group coordination. You unlock rich, dense knowledge transfer within groups, which gives you more shared coverage of the surface area. With 7-9 engineers per group you can build a real on call rotation, which means fewer heroics and less burnout. The coordination that you do need to do can be more strategic, less tactical, and much more forward-looking.

Would five big teams ship as many lines of code as 10 small teams, even if five engineers become managers and stop writing code? Probably, but who cares? Your customers give zero fucks how many lines of code you write. They care about whether you are building the right things and solving problems that matter to them. What matters is moving the business forward, not churning out code. Don’t forget, the act of churning out code creates costs and externalities in and of itself.

What defines your velocity is that you spend your time on the right things. Learning to make good decisions about what to build is something every organization has to work out for itself, and it is always an ongoing work in progress. Engineering managers don’t do all the work or make all the decisions, but they are absolutely fucking vital, in my experience, to ensuring that work happens and is done well. As I wrote in my last piece, engineering managers are the embodiment of the feedback loops that systems use to learn and improve.

Are managers ever unnecessary overhead?

Sure, absolutely. Management is about coordinating between teams and helping teams run more optimally, so anything that decreases your need for coordination also decreases your need for management. If you are a small company, or if you have really senior folks who are used to working together, you need a lot less coordination. The next most relevant factor is probably the rate of change; if you’re growing fast or have a lot of turnover, or if there’s a lot of time pressure or frequent shifts in strategy, your need for managers goes up. But there are plenty of smaller orgs out there that are doing just fine without a lot of formal management.

Look, I’m not a fan of the word “overhead”, because a) it’s kind of rude and b) people who call managers “overhead” are typically people who disrespect or do not value the craft of management.

But management is, in fact, overhead. 😅 So is a lot of other glue work! By which I mean the work is important, but does not itself move the business forward; we should do as much of it as absolutely necessary and no more. The nature of glue work is such that it too-easily expands to consume all available time and space (and then some). Constraints are good. Feeling a bit underresourced is good, and should be the norm. It is incredibly easy for management to get a bit bloated, and managers can be very loath to acknowledge this, because it’s not like they ever feel any less stressed or stretched.[*]

Management is also very much like operations work in that when it’s being done well, it’s invisible. Evaluating managers can be very hard, especially in the near term, and making decisions about when it’s time to create or pay down organizational debt is a whole nother ball of wax, and way outside the scope of this post.

But yes, managers can absolutely be unnecessary overhead.

However, if you have 40 engineers all reporting to one VP, and nobody else whose number one job is the outcomes related to people, teams and org, I feel pretty safe in saying this is not a risk for you at this time.