Questionable Advice: “How can I drive change and influence teams…without power?”

June 5, 2023July 14, 2023 mipsytipsyadvice, deploys, engineering, influence, management1 Comment

Last month I got to attend GOTO Chicago and give a talk about continuous deployment and high-performing teams. Honestly I did a terrible job, and I’m not being modest. I had just rolled off a delayed redeye flight; I realized partway through that I had the wrong slides loaded, and my laptop screen was flashing throughout the talk, which was horribly distracting and means I couldn’t read the speaker notes or see which slide was next. 😵 Argh!

Anyway, shit happens. BUT! I got to meet some longstanding online friends and acquaintances (hi JJ, Avdi, Matt!) and got to eat some of Hillel Wayne’s homemade chocolates, and the Q&A session afterwards was actually super fun.

My talk was about what high performing teams look like and why it’s so important to be on one (spoiler: because this is the #1 way to become a radically better engineer!!). Most of the Q&A topics therefore came down to some version of “okay, so how can I help my team get there?” These are GREAT questions, so I thought I’d capture a few of them for posterity.

But first… just a reminder that the actual best way to persuade people to listen to you is to make good decisions and display good judgment. Each of us has an implicit reputation score, which formal power can only overcome to an extent. Even the most junior engineer can work up a respectable reputation over time, and even principal engineers can fritter theirs away by shooting off at the mouth. 🥰

“how can I drive change when I have no power or influence?”

This first question came from someone who had just landed their first real software engineering job (congrats!!!):

“This is my first real job as a software engineer. One other junior person and myself just formed a new team with one super-senior guy who has been there forever. He built the system from scratch and knows everything about it. We keep trying to suggest ideas like the things you talked about in your talk, but he always shoots us down. How can we convince him to give it a shot?”

Well, you probably can’t. ☺️ Which isn’t the end of the world.

If you’re just starting to write software every day, you are facing a healthy learning curve for the next 3-5 years. Your one and only job is to learn and practice as much you possibly can. Pour your heart and soul into basic skills acquisition, because there really are no shortcuts. (Please don’t get hooked on chatGPT!!)

I know that I came down hard in my talk on the idea that great engineers are made by great teams, and that the best thing most people can do for their career is to join a high-performing, fast-moving team. There will come a time where this is true for you too, but by then you will have skills and experience, and it will be much easier for you to find a new job, one with a better culture of learning.

It is hard to land your first job as a software engineer. Few can afford to be picky. But as long as you are a) writing code every day, b) debugging code every day, and c) getting good feedback via code reviews, this job will get you where you need to go. When you’re fluent and starting to mentor others, or getting into higher level architecture work, or when you’re starting to get bored … then it’s time to start looking for roles with better teachers and a more collaborative team, so your growth doesn’t stall. (Please don’t fall into the Trap of the Premature Senior.)

This is an apprenticeship industry. You’re like a med student right now, who is just starting to do rounds under the supervision of an attending physician (your super-senior engineer). You can kinda understand why he isn’t inclined to listen to your opinions on his choice of stethoscope or how he fills out a patient chart. A better teacher would take time to listen and explain, but you already know he isn’t one. 🤷

I only have one piece of advice. If there’s something you want to try, and it involves doing engineering work, consider tinkering around and building it after hours. It’s real hard to say no to someone who cares enough to invest their own time into something.

“how can I drive change when I am a tech lead on a new team?”

“I have the same question! — except I’m a tech lead, so in theory I DO have some power and influence. But I just joined a new team, and I’m wondering what the best way is to introduce changes or roll them out, given that there are soooo many changes I’d like to make.”

(I wrote a somewhat scattered post a few years ago on engineers and influence, or influence without authority, which covers some related territory.)

As a tech lead who is new to a team, busting at the seams with changes I want to make, here’s where I’d start:

Understand why things are the way they are and get to know the personalities on your team a bit before you start pitching changes. (UNLESS they are coming to you with arms outstretched, pleading desperately for changes ~fast~ because everything is on fire and they know they need help. This does happen!)
Spend some time working with the old systems, even if you think you already understand. It’s not enough for you to know; you need to take the team on this journey with you. If you expect your changes to be at all controversial, you need to show that you respect their work and are giving it a chance.
Change one thing at a time, and go for the developer experience wins first. Address things that will visibly pay off for your team in terms of shipping faster, saving time, less frustration. You have no credibility in the beginning, so you want to start racking up wins before you take on the really hard stuff.
Roll up your sleeves. Nothing buys a leader more goodwill than being willing to do the scut work. Got a flaky test suite that everybody has been dreading trying to fix? I smell opportunity…
Pitch it as an experiment. If people aren’t sold on your idea for e.g. code review SLAs, ask if they’d be willing to try it out for three weeks just as an experiment.
Strategically shop it around to the rest of the team, if you sense there will be resistance…

At this point in my answer 👆 I outlined a technique for persuading a team and building support for a plan or an idea, especially when you already know it’s gonna be an uphill battle. Hillel Wayne said I should write it up in a blog post, so here it is! (I’ll do anything for free chocolate 😍)

“How can I get people on board with my controversial plan?”

So you have a great idea, and you’re eager to get started. Awesome!!! You believe it’s going to make people’s lives better, even though you know you are going to have to fight tooth and nail to make it happen.

What NOT to do:

Walk into the team meeting and drop your bomb idea on everyone cold:

“Hey, I think we should stop shipping product changes until we fix our build pipeline to the point where we can auto-deploy each merge set to production, one at a time, in under an hour.” ~ (for example)

…. then spend the rest of the hour grappling with everybody’s thoughts, feelings, and intense emotional reactions, before getting discouraged and slinking away, vowing to never have another idea, ever again.

What to do instead:

Suss out your audience. Who will be there? How are they likely to react? Are any of them likely to feel especially invested in the existing solution, maybe because they built it? Are any of them known for their strong opinions or being combative?

Great!!! Your first move is to have a conversation with each of them. Approach them in the spirit of curiosity, and ask what they think of your idea. Talking with them will also help you hash out the details and figure out if it is actually a good idea or not.

Your goal is to make the rounds, ask for advice, identify any allies, and talk your idea through with anybody who is likely to oppose you…before the meeting where you intend to unveil your plan. So that when that happens, you have:

given people the chance to process their reactions and ask questions in private
ensured that key people will not feel surprised, threatened, or out of the loop
already heard and discussed any objections
ideally, you have earned their support!

Even if you didn’t manage to convince every person, this was still a valuable exercise. By approaching people in advance, you are signaling that you respect them and their voice matters. You are always going to get people’s absolute worst reactions when you spring something on them in a group setting; any anxiety or dismay will be amplified tenfold. By letting them reflect and ask questions in private, you’re giving time for their better selves to emerge.

What to do instead…if you’re a manager:

As an engineer or a tech lead, you sometimes end up out front and visible as the owner of a change you are trying to drive. This is normal. But as a manager, there are far more times when you need to influence the group but not be the leader of the change, or when you need to be wary of sounding like you are telling people what to do. These are just a few of the many reasons it can be highly effective to have other people arguing on your behalf.

In the ideal scenario, particularly on technical topics, you don’t have to push for anything. All you do is pose the question, then sit back and listen as vigorous debate ensues, with key stakeholders and influential engineers arguing for your intended outcome. That’s a good sign that not only are they convinced, they feel ownership over the decision and its execution. This is the goal! 🌈

It’s not just about persuading people to agree with you, either. Instead of having a shitty dynamic where engineers are attached to the old way of doing things and you are “dragging them” into the newer ways against their will, you are inviting them to partner with you. You are offering them the opportunity to lead the team into the brave new world, by getting on board early.

(It probably goes without saying, but always start with the smallest relevant group of stakeholders, and not, say, all of engineering, or a group that has no ownership over the given area. 🙃 And … even this strategy will stop working rather quickly, if your controversial ideas all turn out to be disastrous. 😉)

“How do I know where to even start?!? 😱”

Before I wrap up, I want to circle back to the question from the tech lead about how to drive change on a team when you do have some influence or power. He went on to say (or maybe this was from a third questioner?*):

“There is SO MUCH I’d like to do or change with our culture and our tech stack. Where can I even start??”

Yeah, it can be pretty overwhelming. And there are no universal answers… as you know perfectly well, the answer is always “it depends.” ☺️ But in most cases you can reduce the solution space substantially to one of the two following starting points.

1. Can you understand what’s going on in your systems? If not, start with observability.

It doesn’t have to be elegant or beautiful; grepping through shitty text logs is fine, if it does the trick. But do any of the following make you shudder in recognition?:

If I get paged, I might lose the rest of the afternoon trying to figure out what happened
Our biggest problem is performance and we don’t know where the time is going
We have a lot of flaky, flappy alerts, and unexplained outages that simply resolve themselves without our ever truly understanding what happened.

If you can’t understand what’s going on in your system, you have to start with instrumentation and observability. It’s just too deadly, and too risky, not to. You’re going to waste a ton of time stabbing around in the dark trying to do anything else without visibility. Put your glasses on before you start driving down the freeway, please.

2. Can you build, test and deploy software in under an hour? If not, start with your deploy pipeline.

Specifically, the interval of time between when the code is written and when it’s being used in production. Make it shorter, less flaky, more reliable, more automated. This is the feedback loop at the heart of software engineering, which means that it’s upstream from a whole pile of pathologies and bullshit that creep in as a consequence of long, painful, batched-up deploys.

Here’s a talk I’ve given a few times on why this matters so much:

You pretty much can’t fail with one of those two; your lives will materially improve as you make progress. And the iterative process of doing them will uncover a great deal of shit you should probably know about.

Cheers! 🥂

charity.

* My apologies if I remembered anyone’s question inaccurately!

Architects, Anti-Patterns, and Organizational Fuckery

March 9, 2023July 14, 2023 mipsytipsyarchitects, architecture, careers, engineering, management12 Comments

I recently wrote a twitter thread on the proper role of architects, or as I put it, tongue-in-cheek-ily, whether or not architect is a “bullshit role”.

I don't know if this is a troll or not, but with all apologies to my many wonderful, highly skilled "Architect" friends, I tend to think it's a bullshit role. 🙃

I believe that only the people building software systems get to have opinions on how those systems get built. https://t.co/aUShYfyIKY
— Charity Majors (@mipsytipsy) February 22, 2023

It got a LOT of reactions (2.5 weeks later, the thread is still going!!), which I would sort into roughly three camps:

“OMG this resonates; this matches my experiences working with architects SO MUCH”,
“I’m an architect, and you’re not wrong”, and
“I’m an architect and I hate you.”

Some of your responses (in all three categories!) were truly excellent and thought-provoking. THANK YOU — I learned a ton. I figured I should write up a longer, more readable, somewhat less bombastic version of my original thread, featuring some of my favorite responses.

Where I’m Coming From

Just to be clear, I don’t hate architects! Many of the most brilliant engineers I have ever met are architects.

Nor do I categorically believe that architects should not exist, especially after reading all of your replies. I received some interesting and compelling arguments for the architect role at larger enterprises, and I have no reason to believe they are not true.

Also, please note that I personally have never worked at a company with “architect” as a role. I have also never worked anywhere but Silicon Valley, or at any company larger than Facebook. My experiences are far from universal. I know this.

Feel like architecture is a leaky abstraction for organizational disfunction.
— rocpatel (@rocpatel) February 22, 2023

Let me get suuuuuper specific here about what I’m reacting to:

When I meet a new “architect”, they tend toward the extremes: either world class and amazing or useless and out of touch, with precious little middle ground.
When I am interviewing someone whose last job title was “architect”, they often come from long tenured positions, and their engineering skills are usually very, very rusty. They often have a lot of detailed expertise about how their last company worked, but not a lot of relevant, up-to-date experience.
Because of 👆, when I see “architect” on a job ladder, I tend to feel dubious about that org in a way I do not when I see “staff engineer” or “principal engineer” on the ladder.

What I have observed is that the architect role tends to be the locus of a whole mess of antipatterns and organizational fuckery. The role itself can also be one that does not set up the people who hold it for a successful career in the long run, if they are not careful. It can be a one-way street to being obsolete.

I think that a lot of companies are using some of their best, most brilliant senior engineers as glorified project manager/politicians to paper over a huge amount of organizational dysfunction, while bribing them with money and prestige, and that honestly makes me pretty angry. 😡

But title is not destiny. And if you are feeling mad because none of what I’ve written applies to you, then I’m not writing about you! Live long and prosper. 🖖

Architect Anti-patterns and fuckery

people sometimes talk about "flattening" an org structure, but they may really just lack efficient/durable cross-tree links (which doesn't require a reorg)! easier said than done of course
— d@nny "disc@" mcClanahan (@hipsterelectron) March 7, 2023

There is no one right way to structure your org and configure your titles, any more than there is any one right way to architect your systems and deploy your services. And there is an eternal tension between centralization and specialization, in roles as well as in systems.

Most of the pathologies associated with architects seem to flow from one of two originating causes:

unbundling decision-making authority from responsibility for results, and
design becoming too untethered from execution (the “Frank Gehry” syndrome)

But it’s only when being an architect brings more money and prestige than engineering that these problems really tend to solidify and become entrenched.

Skin In The Game

I do agree alot here. You should never stop coding, never stop reviewing, never stop implementing what you design, do not fall in the trap of "just thinking alot", you have to do alot, be there, solve the messy problems, dont expect others will do it, pair with them. https://t.co/frpn0gMCsX
— Lucas Coppio (@lscoppio) February 23, 2023

When that happens, you often run into the same fucking problem with architects and devs as we have traditionally seen with devs and ops. Only instead of “No, I can’t be on call or get woken up, my time is far too valuable, too busy writing important software”, the refrain is, “No, I can’t write software or review code, my time is far too valuable, I’m much too busy telling other people how to do their jobs.”

This is also why I think calling the role “architect” instead of “staff engineer” or “principal engineer” may itself be kind of an anti-pattern. A completely different title implies that it’s a completely different job, when what you really want, at least most of the time, is an engineer performing a slightly different (but substantially overlapping) set of functions as a senior engineer.

Taleb agreeshttps://t.co/oUoeGL9ytO
— Mathias Lafeldt (@mlafeldt) March 6, 2023

My core principle here is simple: only the people responsible for building software systems get to make decisions about how those systems get built. I can opine all I want on your architecture or ours, but if I’m not carrying a pager for you, you should probably just smile politely and move along.

Technical decisions should be ultimately be made by the people who have to live with the consequences. But good architects will listen to those people, and help co-create architectural decisions that take into account local, domain, and enterprise perspectives (a Katy Allred quote).

Architecture is a core engineering skill

When you make architecture “someone else’s problem” and scrap the expectation that it is a core skill, you get weaker engineers and worse systems.

Learning to see the forest as well as the trees, and factor in security, maintainability, data integrity and scale, performance, etc is a *critical* part of growing up as an engineer into senior roles.

"When you make architecture someone else's job and scrap the expectation that it's a core skill, you get weaker engineers & worse systems." - this part resonated with me very much.

You can also replace "architecture" with "security", "testing" or various others.
— C (@_cd83) March 6, 2023

The story of QA is relevant here. Once upon a time, every technical company had a QA department to test their code and ensure quality. Software engineers weren’t expected to write tests for their code — that was QA’s job. Eventually we realized that we wrote better software when engineers were held responsible for writing their own tests and testing their own code.

Developers howled and complained: they didn’t have time! they would never get anything built! But it gradually became clear that while it may take more time up front to write and test code, it saved immensely more time and pain in the longer run because the code got so much better and problems got found so much earlier.

It’s not like we got rid of QA — QA departments still exist, especially in some industries, but they are more like consulting experts. They write test suites and test software, but more importantly they are a resource to make sure that everybody is writing good tests and shipping quality software.

I think the root problem is too many devs don't THINK in terms of systems architecture. So many orgs try to "centralize" that function vs. fix the problem of helping devs to think differently & how they integrate.
— Jason McIntosh (@mc717990) February 22, 2023

This was long enough ago that most people writing code today probably don’t remember this. (It was mostly before my own time as well.) But you hear echoes of the same arguments today when engineers are complaining about having to be on call for their code, or write instrumentation and operate their code in production.

The point is not that every engineer has to do everything. It’s that there are elements of testing, operations, and architecture that every software engineer needs to know in order to write quality code — in order to not make mistakes that will cost you dearly down the line.

Specialists are not here to do the job for you, they’re to help you do the job better.

When embedded in a team, the architects would write an awful lot of code. But when on the arch team, the role was more about talking to other engineers, reading a lot of code, and generating a lot of diagrams and documentation about the current/future systems as your output.
— Pay Tree Arch, Context Dependent (@paytreearch) March 5, 2023

“Architect” Done Right

I’ve had various Enterprise Architect roles over the past 15 years. It’s a common role to dunk on, because we’ve historically done a terrible job with enablement. When done well, the role requires constant movement to stay relevant with teams. You can’t only be high level. https://t.co/LxPTJAU3l5
— Kevin Swiber (@kevinswiber) March 5, 2023

If you must have architects at all, I suggest:

Grow your architects from within. The best high-level thinkers are the ones with a thorough grounding in the context and the particulars.
Be clear about who gets to have opinions vs who gets to make decisions. Having architects who consult, educate, and support is terrific. Having “pigeon architects” who “swoop and poop” — er, make technical decisions for engineers to implement — is a recipe for resentment and weak architectures.
Pay them the same as your staff or principal engineers, not dramatically more. Create an org structure that encourages pendulum swings between (eng, mgr, arch) roles, not one with major barriers in form of pay or level disparities.
Consider adopting one of the following patterns, which do a decent job of evading the two main traps we described above.

If your architects don’t have the technical skills, street cred, or time to spend growing baby engineers into great engineers, or mentoring senior engineers in architecture, they are probably also crappy architects. (another Katy Allred quote)

The “Embedded Architect” (aka Staff+ Engineer)

"Architecture Astronaut" -- n. Someone with 'architect' in their job title, but who doesn't code.
— ./adhddd # 🍁 (@douglasdd) February 22, 2023

The most reliable way I know to align architecture and engineering goals is for them to be done by the same team. When one team is responsible for designing, developing, maintaining, and operating a service, you tend to have short, tight, feedback loops that let you ship products and iterate swiftly.

Here is one useful measure of your system’s complexity and the overhead involved in making changes:

“How long does it take you to ship a one-character fix?”

There are many other measures, of course, but this is one of the most important. It gets to the heart of why so many engineers get fed up with working at big companies, where the overhead for change is SO high, and the threshold for having an impact is SO long and laborious.

https://twitter.com/jetpack/status/1633005928399384576

The more teams have to be involved in designing, reviewing, and making changes, the slower you will grind. People seem to accept this as an inevitability of working in large and complex systems far more than I think they should.

Embedding architecture and operations expertise in every engineering team is a good way to show that these are skills and responsibilities we expect every engineer to develop.

This is the model that Facebook had. It is often paired with,

The “Architecture Group” of Practicing Engineers

twitter had a Twitter Architecture Group (TAG) composed of practicing engineers across the co (two of them were on my team). they met once or twice a month and collabed on docs and i thought it was a great way to ensure that people doing the work are the ones making decisions
— d@nny "disc@" mcClanahan (@hipsterelectron) March 5, 2023

Every company eventually needs a certain amount of standardization and coordination work. Sometimes this means building out a “Golden Path” of supported software for the organization. Sometimes this looks like a platform engineering team. Sometimes it looks like capacity planning years worth of hardware requirements across hundreds of teams.

I’ve seen this function fulfilled by super-senior engineers who come together informally to discuss upcoming projects at a very high level. I’ve seen it fulfilled by teams that are spun up by leadership to address a specific problem, then spun down again. I’ve seen it fulfilled by guilds and other formal meetings.

Although systems design (and thus architecture) are real, I tend to agree with Charity that you shouldn’t be doing it unless you’re also very close to the implementation details. It’s way too easy to miss a critical detail that derails your entire plan otherwise. And not notice.
— apenwarr (@apenwarr) March 5, 2023

These conversations need to happen, absolutely no question about it. The question is whether it’s some people’s full time job, or one of many part-time roles played by your most senior engineers.

I’m more accustomed to the latter. Pro: it keeps the conversations grounded in reality. Con: engineers don’t have a lot of time to spend interfacing with other groups and doing “project management” or “stakeholder management”, which may be a sizable amount of work at some companies.

The “architect-engineer” pendulum

Great topic. Yes and no for me. I definitely grew from within… but I feel like “waves” are needed… You come from deep in the systems/code. You surface use exp to influence and set direction. Then at some point you have to dive deep again.. repeat repeat.
— Brian Chambers (@BriChamb) March 5, 2023

The architect-to-engineer pendulum seems like the only strategy short of embedded architects / shared ownership that seems likely to yield consistently good results, in my opinion.

The reasoning behind this is similar to the reasons for saying that engineering managers should probably spend some time doing hands-on work every few years. You need to be a pretty good engineer before you can be a good engineering manager or a good architect, and 5+ years after doing any hands-on work, you probably aren’t one anymore.

I’ve done the architect to engineer pendulum as many times as the IC to manager pendulum. There are contexts in which architects are useful (eg integrating 3rd party systems like platforms) - requiring them to code (Musk-style) doesn’t mean they take better design decisions.
— Jan Peuker (@janpeuker) March 5, 2023

If you’re the type of architect that is part of an engineering team, partly responsible for a product, shipping code for that product, or on call for that product, this may not apply to you. But if you’re the type of architect that spends little if any time debugging/understanding or building the systems you architect, you should probably make a point of swinging back and forth every few years.

The “Time-Share Architect”

Is there precedent in large organizations for Architects with a capital "A" being more of a "Tour of Duty" type role for senior-most engineers, something akin to a really extended "pager rotation"?

It seems better suited for that vs. a permanent role.
— Eric Nograles (@grales) March 1, 2023

This one has aspects of both the “Architecture Working Group” and the “Architect-Engineer Pendulum”. It treats architecture is a job to be done, not a role to be occupied. Thinking of it like a “really extended pager rotation” is an interesting idea.

Somewhat relatedly — at Honeycomb, “lead engineer” is a title attached to a particular project, and refers to a set of actions and responsibilities for that project. It isn’t a title that’s attached to a particular person. Every engineer gets the opportunity to lead projects (if they want to), and everybody gets a break from doing the project management stuff from time to time. The beautiful thing about this is that everybody develops key leadership skills, instead of embodying them in a single person.

The important thing is that someone is performing the coordination activities, but the people building the system have final say on architecture decisions.

The “Advisor Architect”

I agree with this! I do _some_ coding but mostly I'm product, so I may have suggestions or ask questions about broad architecture/eng choices, but every one comes with a reminder that they can disregard my suggestions entirely. They don't even have to say why.
— Michael Snook (@michaelsnook) March 5, 2023

I honestly have no problem with architects who are not seen as senior to, and do not have opinions overriding those of, the senior engineers who are building and maintaining the system.

Engineers who are making architectural decisions should consult lots of sources and get lots of opinions. If architects provide educated opinions and a high level view of the systems, and the engineers make use of their expertise, well that’s fan fucking tastic.

If architects are handing them assignments, or overriding their technical decisions and walking off, leaving a mess behind … fuck that shit. That’s the opposite of empowerment and ownership.

The most effective architecture organization I've ever seen effectively operated as an internal analyst organization. Think captive Gartner (or Redmonk, ideally), working with senior engineering resources to identify and document internally.
— Jonathan Altman (@async_io) February 22, 2023

The “skin in the game” rule of thumb still holds, though. The less an architect is exposed to the maintenance and operational consequences of decisions, the less sway their opinion should hold with the group. It doesn’t mean it doesn’t bring value. But the limitations of opinions at a distance should be made clear.

The Threat to Architects’ Careers

It’s super flattering to be told you are just too important, your time is too valuable for you to fritter it away on the mundane acts of debugging and reviewing PRs. (I know! It feels great!!!) But I don’t think it serves you well. Not you, or your team, your company, customers, or the tech itself.

And not *every* architect role falls into this trap. But there’s a definite correlation between orgs that stop calling you “engineers” and orgs that encourage (or outright expect) you to stop engineering at that level. In my experience.

Am an architect. 100% agree. Your engineering skills begin to degrade and your project management skills grow slowly but weakly.
— crozierm (@crozierm) February 22, 2023

But your credibility, your expertise, your moral authority to impose costs on the team are all grounded in your fluency and expertise with this codebase and this production system — and your willingness to shoulder those costs alongside them. (All the baby engineers want to grow up to be a principal engineer like this.)

But if you aren’t grounded in the tech, if you don’t share the burden, your direction is going to be received with some (or a LOT of) cynicism and resentment. Your technical work will also be lower quality.

Furthermore, you’re only hurting yourself in the long run. Some of the most useless people I’ve ever met were engineers who were “promoted” to architect many, many years ago, and have barely touched an editor or production shell since. They can’t get a job anywhere else, certainly not with comparable status or pay, and they know it. 🤒

I get paid to be an architect, and at least half of the architects I work with live in fear that someone will discover they've forgotten how to code, the rest of us are devs that get paid a bit more to sometimes draw a picture.
— canihavebiscuit (@canihavebiscuit) February 23, 2023

They may know EVERYTHING about the company where they work, but those aren’t transferable skills. They have become a super highly paid project manager.

And as a result … they often become the single biggest obstacle to progress. They are just plain terrified of being automated out of a job. It is frustrating to work with, and heartbreaking to watch. 💔

Don’t become that sad architect. Be an engineer. Own your own code in production. This is the way.

Coda: On “Solutions Architects”

You might note that I didn’t include solutions architects in this thread. There is absolutely a real and vibrant use for architects who advise. The distinction in my mind is: who has the last word, the engineers or the architect? Good engineering teams will seek advice from all kinds of expert sources, be they managers or architects or vendors.

My complaint is only with “architects” who are perceived to be superior to, and are capable of overruling the judgments of, the engineering team.

Exceptions abound; the title is not the person. My observations do not obviate your existence as a skilled technologist. You obviously know your own role better than I do. 🙃

charity

Deploys Are The ✨WRONG✨ Way To Change User Experience

March 8, 2023July 14, 2023 mipsytipsycontinuous deployment, crossposted, deploys, engineering, feature flags, friday deploys2 Comments

This piece was first published on the honeycomb.io blog on 2023-03-08.

….

I’m no stranger to ranting about deploys. But there’s one thing I haven’t sufficiently ranted about yet, which is this: Deploying software is a terrible, horrible, no good, very bad way to go about the process of changing user-facing code.

It sucks even if you have excellent, fast, fully automated deploys (which most of you do not). Relying on deploys to change user experience is a problem because it fundamentally confuses and scrambles up two very different actions: Deploys and releases.

Deploy

“Deploying” refers to the process of building, testing, and rolling out changes to your production software. Deploying should happen very often, ideally several times a day. Perhaps even triggered every time an engineer lands a change.

Everything we know about building and changing software safely points to the fact that speed is safety and smaller changes make for safer deploys. Every deploy should apply a small diff to your software, and deploys should generally be invisible to users (other than minor bug fixes).

Release

“Releasing” refers to the process of changing user experience in a meaningful way. This might mean anything from adding functionality to adding entire product lines. Most orgs have some concept of above or below the fold where this matters. For example, bug fixes and small requests can ship continuously, but larger changes call for a more involved process that could mean anything from release notes to coordinating a major press release.

A tale of cascading failures

Have you ever experienced anything like this?

Your company has been working on a major new piece of functionality for six months now. You have tested it extensively in staging and dev environments, even running load tests to simulate production. You have a marketing site ready to go live, and embargoed articles on TechCrunch and The New Stack that will be published at 10:00 a.m. PST. All you need to do now is time your deploy so the new product goes live at the same time.

It takes about three hours to do a full build, test, and deploy of your entire system. You’ve deployed as much as possible in advance, and you’ve already built and tested the artifacts, so all you have to do is a streamlined subset of the deploy process in the morning. You’ve gotten it down to just about an hour. You are paranoid, so you decide to start an hour early. So you kick off the deploy script at 8:00 a.m. PST… and sit there biting your nails, waiting for it to finish.

SHIT! 20 minutes through the deploy, there’s a random flaky SSH timeout that causes the whole thing to cancel and roll back. You realize that by running a non-standard subset of the deploy process, some of your error handling got bypassed. You frantically fix it and restart the whole process.

Your software finishes deploying at 9:30 a.m., 30 minutes before the embargoed articles go live. Visitors to your website might be confused in the meantime, but better to finish early than to finish late, right? 😬

Except… as 10:00 a.m. rolls around, and new users excitedly begin hitting your new service, you suddenly find that a path got mistyped, and many requests are returning 500. You hurriedly merge a fix and begin the whole 3-hour long build/test/deploy process from scratch. How embarrassing! 🙈

Deploys are a terrible way to change user experience

The build/release/deploy process generally has a lot of safeguards and checks baked in to make sure it completes correctly. But as a result…

It’s slow
It’s often flaky
It’s unreliable
It’s staggered
The process itself is untestable
It can be nearly impossible to time it right
It’s very all or nothing—the norm is to roll back completely upon any error
Fixing a single character mistake takes the same amount of time as doubling the feature set!

Changing user-visible behaviors and feature sets using the deploy process is a great way to get egg on your face. Because the process is built for distributing large code distributions or artifacts; user experience gets changed only as a side effect.

So how should you change user experience?

By using feature flags.

Feature flags: the solution to many of life’s software’s problems

You should deploy your code continuously throughout the day or week. But you should wrap any large, user-visible behavior changes behind a feature flag, so you can release that code by flipping a flag.

This enables you to develop safely without worrying about what your users see. It also means that turning a feature on and off no longer requires a diff, a code review, or a deploy. Changing user experience is no longer an engineering task at all.

Deploys are an engineering task, but releases can be done by product managers—even marketing teams. Instead of trying to calculate when to begin deploying by working backwards from 10:00 a.m., you simply flip the switch at 10:00 a.m.

Testing in production, progressive delivery

The benefits of decoupling deploys and releases extend far beyond timely launches. Feature flags are a critical tool for apostles of testing in production (spoiler alert: everybody tests in production, whether they admit it or not; good teams are aware of this and build tools to do it safely). You can use feature flags to do things like:

Enable the code for internal users only
Show it to a defined subset of alpha testers, or a randomized few
Slowly ramp up the percentage of users who see the new code gradually. This is super helpful when you aren’t sure how much load it will place on a backend component
Build a new feature, but only turning it on for a couple “early access” customers who are willing to deal with bugs
Make a perf improvement that should be bulletproof logically (and invisible to the end user), but safely. Roll it out flagged off, and do progressive delivery starting with users/customers/segments that are low risk if something’s fucked up
Doing something timezone-related in a batch process, and testing it out on New Zealand (small audience, timezone far away from your engineers in PST) first

Allowing beta testing, early adoption, etc. is a terrific way to prove out concepts, involve development partners, and have some customers feel special and extra engaged. And feature flags are a veritable Swiss Army Knife for practicing progressive delivery.

It becomes a downright superpower when combined with an observability tool (a real one that supports high cardinality, etc.), because you can:

Break down and group by flag name plus build id, user id, app id, etc.
Compare performance, behavior, or return code between identical requests with different flags enabled
For example, “requests to /export with flag “USE_CACHING” enabled are 3x slower than requests to /export without that flag, and 10% of them now return ‘402’”

It’s hard to emphasize enough just how powerful it is when you have the ability to break down by build ID and feature flag value and see exactly what the difference is between requests where a given flag is enabled vs. requests where it is not.

It’s very challenging to test in production safely without feature flags; the possibilities for doing so with them are endless. Feature flags are a scalpel, where deploys are a chainsaw. Both complement each other, and both have their place.

“But what about long-lived feature branches?”

Long-lived branches are the traditional way that teams develop features, and do so without deploying or releasing code to users. This is a familiar workflow to most developers.

But there is much to be said for continuously deploying code to production, even if you aren’t exposing new surface area to the world. There are lots of subterranean dependencies and interactions that you can test and validate all along.

There’s also something very psychologically different between working with branches. As one of our engineering directors, Jess Mink, says:

There’s something very different, stress and motivation-wise. It’s either, ‘my code is in a branch, or staging env. We’re releasing, I really hope it works, I’ll be up and watching the graphs and ready to respond,’ or ‘oh look! A development customer started using my code. This is so cool! Now we know what to fix, and oh look at the observability. I’ll fix that latency issue now and by the time we scale it up to everyone it’s a super quiet deploy.’

Which brings me to another related point. I know I just said that you should use feature flags for shipping user-facing stuff, but being able to fix things quickly makes you much more willing to ship smaller user-facing fixes. As our designer, Sarah Voegeli, said:

With frequent deploys, we feel a lot better about shipping user-facing changes via deploy (without necessarily needing a feature flag), because we know we can fix small issues and bugs easily in the next one. We’re much more willing to push something out with a deploy if we know we can fix it an hour or two later if there’s an issue.

Everything gets faster, which instills more confidence, which means everything gets faster. It’s an accelerating feedback loop at the heart of your sociotechnical system.

“Great idea, but this sounds like a huge project. Maybe next year.”

I think some people have the idea that this has to be a huge, heavyweight project that involves signing up for a SaaS, forking over a small fortune, and changing everything about the way they build software. While you can do that—and we’re big fans/users of LaunchDarkly in particular—you don’t have to, and you certainly don’t have to start there.

As Mike Terhar from our customer success team says, “When I build them in my own apps, it’s usually just something in a ‘configuration’ database table. You can make a config that can enable/disable, or set a scope by team, user, region, etc.”

You don’t have to get super fancy to decouple deploys from releases. You can start small. Eliminate some pain today.

In conclusion

Decoupling your deploys and releases frees your engineering teams to ship small changes continuously, instead of sitting on branches for a dangerous length of time. It empowers other teams to own their own roadmaps and move at their own pace. It is better, faster, safer, and more reliable than trying to use deploys to manage user-facing changes.

If you don’t have feature flags, you should embrace them. Do it today! 🌈

Every Achievement Has A Denominator

October 3, 2022July 14, 2023 mipsytipsyculture, engineering, management, productivity5 Comments

One of the classic failure modes of management is the empire-builder — the managers who measure their own status, rank or value by the number of teams and people “under” them.

Everyone knows you aren’t supposed to do this, but most of us secretly, sheepishly do it anyway to some extent. After all, it’s not untrue — the more teams and people that roll up to you, the wider your influence and the more impact you have on more people, by definition.

The other reason is, well, it’s what we’ve got. How else are we supposed to gauge our influence and impact, or our skill as a leader? We don’t really have any other language, metrics or metaphors readily available to us. 😖

Well… Here’s one:

✨Every achievement has a denominator✨

Organization size can be a liability

Let’s say you have 1,000 people in your org and you collectively achieve something remarkable. Good for you!

What if you achieved the same thing with 10,000 people, instead? What would that say about your leadership?

What if you achieved the same thing with 100 people?

Or even 10 people?

Lots of people take pride in their ability to manage large organizations. And with thousands of people in your org, you kinda better do something fucking great. But what if you instead took pride in your ability to deliver outsize results with a small denominator?

What if comp didn’t automatically bloat with the size of your org, but rather the impact of your work divided by the number of contributors — rewarding leaders for leaner teams, not larger ones?

Bigness itself is costly. There’s the cost of the engineers, managers, product and designers etc, of course. But the bigger it gets, the more coordination costs are incurred, which are the worst costs of all because they do not accrue to any user benefit — and often lead to lack of focus and product surface sprawl.

Constraints fuel creativity. Having “enough” engineers for a project is usually a terrible idea; you want to be constrained, you want to have to make hard decisions about where to spend your time and where to invest development cycles.

More often than not, scope is the enemy

As Ben Darfler wrote earlier this year about our approach to engineering levels at Honeycomb:

There are times when broad scope may be unavoidable, but at Honeycomb, we try to cultivate a healthy skepticism toward scope. More often than not, scope is the enemy. We would rather reward engineers who find clever ways to limit scope by decomposing problems in both time and size. We also want to reward engineers who work on the most important problems for the business, regardless of the size of the project. We don’t want to reward people for gaming out their work based on what will get them promoted.

The same is true for engineering managers, directors and VPs. We would rather reward them for getting things done with small, nimble teams, not for empire building and team sprawl. We want to reward them for working on the most important problems for the business, regardless of what size their teams are.

What was the denominator of the last big project you landed? Could you have done it with fewer people? How will you apply those learnings to the next big initiative?

Can we find more language and ways to talk about, or take pride in how efficiently we do big things? At the very least, perhaps we can start paying attention to the denominator of our achievements, and factor that into how we level and reward our leaders.

charity.

P.S. I did not invent this phrase, but I am unfortunately unable to credit the person I heard it from (a senior Googler). I simply think it’s brilliant, and so helpful.

The Future of Ops is Platform Engineering

September 30, 2022July 14, 2023 mipsytipsycareers, crossposted, culture, engineering, operations, platform engineering, sreLeave a comment

First published on 2022-09-30 at https://www.honeycomb.io/blog/future-ops-platform-engineering.

Two years ago I wrote a piece in The New Stack about the Future of Ops Careers. Towards the end, I wrote:

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin, and ClamAV—the world has Gmail. (…)

Building infrastructure and operational expertise used to be bundled together into a single role. But the industry is now bifurcating along an infrastructure fault line, and the overlap between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. Engineers who love this work increasingly have a choice to make. Either you can 1) go deep on infrastructure by joining a company that does infrastructure as a service, or 2) go broad on operability by joining a company to help them do as little infrastructure as possible.

I described the second category as “operations engineering minus the infrastructure,” dedicated to evaluating and assembling a production stack of third-party platform providers, enabling software engineers to self-serve their services and own their own code in production. I said:

Your job will be to aggressively minimize the cycles your org devotes to infrastructure by finding effective ways to outsource or minimize infra labor. Your job is to NOT go deep if there is any workable alternative.
Your job will be to work cross-functionally with all the other software engineering teams, looking for ways to speed up their time to value and helping them own their own code in production.
Your job will be to move past the kludgey old models of “outsourcing” to sophisticated understandings of how and where to leverage abstractions that can radically accelerate development.

That second category I was describing now has a name. We call those teams “platform engineering.”

The fifty-year arc of software careers

In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different professions, but that turned out to be a ginormous mistake, so along came DevOps to reunify them. Nowadays, ops as an independent profession is in the process of fading out. Companies are spinning down their ops teams left and right. Engineers who formerly identified as sysadmins or operations have turned into DevOps engineers, and soon there will just be “software people” again. This is the way of things.

Please note that this is NOT the same thing as saying “ops is dead,” or “ops skills are no longer valuable or needed¹.” Our systems are only getting more complex, more difficult to operate, and simultaneously more critical to life on earth, which means that operational excellence has never been more desperately needed (and if you don’t respect that, 🌈 you deserve to suffer 🌈).

The industry story of the past three to five years has been us trying to figure out how to help software engineers own their own code in production², phasing out dedicated ops teams, and aggressively outsourcing as much infrastructure as possible.

As we should. Developer cycles are the scarcest resource in your company, and you want to spend as many of those as possible on your core product: the crown jewel, the code that makes you a business. Money is cheaper than engineering cycles, and teams that are focused on their core business will always outperform teams whose focus is spread across dozens of non-revenue-generating projects. Let someone else build and run all the dependencies and adjacencies.

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write.

Platform engineering is what stands between you and darkness

When you start talking about putting software engineers on call for their own code, and generally being more involved in production, some percentage of the time you will hear back a guttural wail of despair: “You can’t expect me to know EVERYTHING about EVERYTHING!”

Quite right; we can’t. Platform engineering teams are part of the answer to this perfectly reasonable complaint. It’s not that you’re being asked to do or understand more in toto, but the distribution of labor and responsibility is shifting:

Before: some engineers wrote code, and some engineers ran code.

Now: all engineers write code, and all engineers run the code they write—but we divide the areas of responsibility by layer or function.

The emergence of a minimum viable self-serve tier

In the earliest days of a company, your first few engineers end up bootstrapping an infrastructure by reading AWS docs or blog posts, or asking a friend for recommendations to get started. They might start by setting up a managed container service, or configuring Terraform, and for a while everybody deploys and owns their own code, just as god intended.

But cognitive limits kick in pretty quickly. The maze of APIs and SDKs and components out there is simply bewildering, even for an experienced ops hand. Before long, it becomes someone’s job to make good decisions, pick a suite of compute and storage options that serve the team’s needs, and write some tooling that pulls everything into a coherent whole—which, at a minimum, lets you:

Run tests and generate new artifacts
Deploy artifacts, version them, and roll back
Instrument, monitor, and debug
Store data somewhere, manage schemas and migrations
Adjust capacity as needed
Define and commit all components (and their relationships) as code

Once these are built, it should be trivial for an engineer to come along and spin up a new service using templates and components from existing services. It should be much simpler and easier to use the blessed paths than anything else, and there should be friction if you go off the beaten path.

Congratulations! You’ve just been platformed 🎉. One of the key principles of any developer platform is that it should be easy to do the right things, and hard to do the wrong things.

The differences between platform engineering and traditional ops

Platform teams are typically staffed by engineers who are comfortable writing software. Not just scripting and automation, but writing tests and doing code reviews. Platform teams also operate much more like product development teams do, with product managers (and occasionally, designers, developer advocates, or UX researchers).

This doesn’t mean that everybody on a platform team has to have originally been a software engineer; in fact, a super common failure condition for platform teams is simply thinking all they need to do is hire software engineers to build developer tools. A strong platform team has an equally deep grounding in operations experience and software development. Individuals who are experts in both areas are fairly rare, but you can pull together a strong, well-rounded team by assembling a mix of SWEs (with some ops experience) and ops or DevOps engineers (with some software experience) and having them learn and grow from each other.

Platform teams are decidedly cloud-native; they actually mostly involve platforms built atop the cloud itself—PaaS, IaaS, everything-aaS, serverless, and so forth.

Ops/DevOps teams are oriented around managing infrastructure, often several generations of infrastructure. Their turf is everything from data centers and bare metal up through virtualization, containers, and the cloud (they aren’t so much cloud-native as cloud-enabled). They measure themselves on things like SLOs and the DORA metrics. You know they’re doing a good job if the system is up/available and users are happy.

Platform teams are oriented around providing a good experience for developers to self-serve and self-manage their code. The more swiftly and easily developers can move, the better your platform team. Operational excellence, in the platform model, is actually more the responsibility of the other engineering teams (and/or an adjacent SRE team) than that of the platform team.

Platform teams typically work higher up the stack than operations, DevOps, or SRE teams do, and they involve a great deal less infrastructure. On the contrary, platform teams are bent on paying other people to run as much shit as possible, preserving their own scarce development cycles for their core product.

Here is a somewhat tongue-in-cheek table of the similarities and differences between the archetypes.

Platform engineers vs. DevOps engineers

	Platform Engineer	Ops (or DevOps) Engineer
% of job spent writing code	> 50%	< 50%
Rest of time spent	Gathering product requirements, doing user research, architecture discussions, optimizing internal workflows, researching new tools and developer productivity ideas, reviewing other teams’ diffs for impact, performance tuning, helping other engineers own & scale their code, fixing CI/CD pipelines.	Fixing cron jobs, automating old setup docs, converting PXE/rsync to Chef/Puppet, converting Chef/Puppet to Terraform, converting VMs to containers, deploying software, debugging broken deploys, writing monitoring checks, doing retros, building out new services, pairing with software engineers to understand and debug their code, investigating weird shit, documentation, etc.
Responsible for	Enabling internal teams to self-serve their ability to run and own their code in production. Creating standard, reusable components and processes. Defining golden paths.	Infrastructure capacity planning, scaling, performance tuning, upgrading. Reliability and resiliency, SLOs and monitoring/alerting. Delivering quality experience to customers.
Builds for	Internal developer teams	Customers
Development style	Infrastructure as a product	Infrastructure as code
Works with product managers	Yes	No
Works with UX researchers or designers	Sometimes	No
Dashboards & graphs	Uses APM, observability, tracing. Cares a lot about instrumentation and OpenTelemetry.	Uses metrics, logs, dashboards; monitoring, alerting, and agent/sidecar/blackbox telemetry.
What ‘coding’ means to them	Developing new features & services, writing tests. These are (primarily) software people who do systems.	Automation, configuration, DSLs, extending and debugging existing code. These are systems people who do software.
Preferred language	Go, Rust	Python, Ruby
Time spent in Linux	Hardly any	A lot
Succeeds when	Developers can easily choose good defaults, self-serve their infra, and own their own code in production.	Infrastructure is scalable, secure, cost-effective, reliable, and customers are happy.
Native terrain	Serverless, *aaS, APIs for everything (cloud-native and above).	Instances, VMs, containers, regions, multi-cloud (everything “below,” but up to and including the cloud).
Databases	Uses hosted DBs	Runs their own, blending automation & DBA expertise
SSH	No	Yes
Shell	REPL	bash/zsh
Mantra	“Run Less Software”	“Cattle, Not Pets”

What about DevOps vs. SRE?

Countless words have been spilled on the difference between DevOps and SRE³, which I won’t rehash.

Here’s what I’ll say: DevOps, to me, feels like a relevant concept for companies that have a lot of infrastructure to wrangle. Companies that do in fact have dev teams and ops teams, or dev teams and DevOps teams (🙄), tend to have a lot of operational shit to automate, test, and run. They use config management, virtualization, and containers, often managing several generations worth of technology, possibly even down to data centers and bare metal. DevOps is for companies that have some combination of bare metal, VMs, regions, AZs, multi-cloud, networking devices, self-managed databases, etc.

DevOps is capacious. It contains multitudes. DevOps writes code, and DevOps has a fuckload of code to manage.

It is also on its way to becoming irrelevant. We are swiftly entering a post-DevOps world.

SRE, to me, feels different. I associate SRE with very large companies, where they mostly have software engineers owning their own code in production, but maybe still struggle with it a bit. SREs are often embedded within software engineering teams or product groups, and they focus a lot on, well, reliability, as the name suggests.

This means they do less infrastructure jockeying or automating (although they still do some coding). They typically have a lot to say about instrumentation, monitoring and observability, and cross-functional coordination. They run incident response and do blameless retros, and they tend to be experts at scaling.

If a company has both a DevOps team and SRE, typically I expect to see the SRE team more on the frontlines, involved with incidents, telemetry, etc., and DevOps teams more on the backburner, slinging pipes and plumbing.

Observability engineering as a case study

In the same piece I referenced earlier, I also wrote about the role of observability teams. I said they should largely no longer be running their own monitoring and graphing software in-house. Yet there is still a place for observability teams to exist: they remain a critical link between outsourced solutions and internal developer needs.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

I originally wrote this about observability, but it could just as easily be used to describe platform engineering as a whole. This is the role—being the bridge between other vendors and your own core software. It’s a very high-leverage place to sit.

Ops is dead, long live ops

I’ve spent a lot of time thinking about this because we’ve had such a hard time nailing down exactly who the Honeycomb customer is. Sometimes our buyer is an ops team buying it for their SWEs, sometimes it’s SREs in the midst of an outage, sometimes it’s a VP or director of engineering, or an architect, or a CTO, or a “full stack” engineering team, or even a product manager. It is hard to form a snappy answer out of that list.

The first couple questions every new go-to-market candidate asks us are “who is your buyer?” and “how do we help them?” To which I respond with a five minute ramble where I list every above persona and each of their pain points. Hardly the concrete answer they would like to receive.

As it goes, sociotechnical trends come and go. A year ago, Christine and I were speculating that platform engineering might be on the verge of consolidating the necessary ingredients that makes up our ideal buyer:

Writing and shipping code, and needing to understand their own code
Positioned to help other teams with their instrumentation patterns and tooling
Firmly cloud-native+ and untethered to hardware or traditional infrastructure

To my delight, since that conversation, these trends have only accelerated—and I, for one, welcome our new platform engineering overlords to the observability table. ☺️

If you’d like to learn more about platform engineering, we’ll be running a Twitter space on ✨ October 20th ✨ at 12:00 p.m. PT. Come join us! I’ll be there along with two colleagues and we’ll be answering your questions and shedding more light on the topic.

¹ I do hear people saying that, and it used to make me fucking furious, but now I just smugly remind myself how much self-inflicted suffering they are in for. Disrespecting operational expertise is the shortest path to never again sleeping through the night.

² It is rather incredible how rapidly this idea has taken off. When we started talking about putting developers on call for their code in 2016, people got seriously angry with us. Before that, the only twitter mention I could find of putting devs on call was one by (of course) Adrian Cockcroft, but by 2019-2020 it had stopped being controversial and soon became common wisdom.

³ I actually wrote one of those myself: DevOps vs SRE: Delayed Coverage of the Dumbest War). LMAO. I think Liz had the final word on this back in … 2017? 2018? … when she said something like class SRE implements DevOps. And yes, DevOps is a philosophy or a methodology and not a job title, etc.

Rituals for Engineering Teams

August 15, 2022July 14, 2023 mipsytipsyculture, engineering, management, operations, rituals, startups13 Comments

Last weekend I happened to pick up a book called “Rituals For Work: 50 Ways To Create Engagement, Shared Purpose, And A Culture That Can Adapt To Change.” It’s a super quick read, more comic book than textbook, but I liked it.

It got me thinking about the many rituals I have initiated and/or participated in over the course of my career. Of course, I never thought of them as such — I thought of them as “having fun at work” 🙃 — but now I realize these rituals disproportionately contribute to my favorite moments and the most precious memories of my career.

Rituals (a definition): Actions that a person or group does repeatedly, following a similar pattern or script, in which they’ve imbued symbolism and meaning.

I think it is extremely worth reading the first 27 pages of the book — the Introduction and Part One. To briefly sum up the first couple chapters: the power of creative rituals comes from their ability to link the physical with the psychological and emotional, all with the benefit of “regulation” and intentionality. Physically going through the process of a ritual helps people feel satisfied and in control, with better emotional regulation and the ability to act in a steadier and more focused way. Rituals also powerfully increase people’s sense of belonging, giving them a stable feeling of social connection. (p. 5-6)

The thing that grabbed me here is that rituals create a sense of belonging. You show that you belong to the group by participating in the ritual. You feel like you belong to the group by participating in the ritual. This is powerful shit!

It seems especially relevant these days when so many of us are atomized and physically separated from our teammates. That ineffable sense of belonging can make all the difference between a job that you do and a role that feeds your soul. Rituals are a way to create that sense of belonging. Hot damn.

So I thought I’d write up some of the rituals for engineering teams I remember from jobs past. I would love to hear about your favorite rituals, or your experience with them (good or bad). Tell me your stories at @mipsytipsy. 🙃

Rituals at Linden Lab

Feature Fish Freeze

At Linden Lab, in the ancient era of SVN, we had something called the “Feature Fish”. It was a rubber fish that we kept in the freezer, frozen in a block of ice. We would periodically cut a branch for testing and deployment and call a feature freeze. Merging code into the branch was painful and time consuming, so If you wanted to get a feature in after the code freeze, you had to first take the fish out of the freezer and unfreeze it.

This took a while, so you would have to sit there and consider your sins as it slowly thawed. Subtext: Do you really need to break code freeze?

Stuffy the Code Reviewer

You were supposed to pair with another engineer for code review. In your commit message, you had to include the name of your reviewer or your merge would be rejected. But the template would also accept the name “Stuffy”, to confess that your only reviewer had been…Stuffy, the stuffed animal.

However if your review partner was Stuffy, you would have to narrate the full explanation of Stuffy’s code review (i.e., what questions Stuffy asked, what changes he suggested and what he thought of your code) at the next engineering meeting. Out loud.

Shrek Ears

We had a matted green felt headband with ogre ears on it, called the Shrek Ears. The first time an engineer broke production, they would put on the Ears for a day. This might sound unpleasant, like a dunce cap, but no — it was a rite of passage. It was a badge of honor! Everyone breaks production eventually, if they’re working on something meaningful.

If you were wearing the Shrek Ears, people would stop you throughout the day and excitedly ask what happened, and reminisce about the first time they broke production. It became a way for 1) new engineers to meet lots of their teammates, 2) to socialize lots of production wisdom and risk factors, and 3) to normalize the fact that yes, things break sometimes, and it’s okay — nobody is going to yell at you. ☺️

This is probably the number one ritual that everybody remembers about Linden Lab. “Congratulations on breaking production — you’re really one of us now!”

Vorpal Bunny

We had a stuffed Vorpal Bunny, duct taped to a 3″ high speaker stand, and the operations engineer on call would put the bunny on their desk so people knew who it was safe to interrupt with questions or problems.

At some point we lost the bunny (and added more offices), but it lingered on in company lore since the engineers kept on changing their IRC nick to “$name-bunny” when they went on call.

There was also a monstrous, 4-foot-long stuffed rainbow trout that was the source of endless IRC bot humor… I am just now noticing what a large number of Linden memories involve stuffed animals. Perhaps not surprising, given how many furries were on our platform ☺️

Rituals at Parse

The Tiara of Technical Debt

Whenever an engineer really took one for the team and dove headfirst into a spaghetti mess of tech debt, we would award them the “Tiara of Technical Debt” at the weekly all hands. (It was a very sparkly rhinestone wedding tiara, and every engineer looked simply gorgeous in it.)

Examples included refactoring our golang rewrite code to support injection, converting our entire jenkins fleet from AWS instances to containers, and writing a new log parser for the gnarliest logs anyone had ever seen (for the MongoDB pluggable storage engine update).

Bonfire of the Unicorns

We spent nearly 2.5 years rewriting our entire ruby/rails API codebase to golang. Then there was an extremely long tail of getting rid of everything that used the ruby unicorn HTTP server, endpoint by endpoint, site by site, service by service.

When we finally spun down the last unicorn workers, I brought in a bunch of rainbow unicorn paper sculptures and a jug of lighter fluid, and we ceremonially set fire to them in the Facebook courtyard, while many of the engineers in attendance gave their own (short but profane) eulogies.

Mission Accomplished

This one requires a bit of backstory.

For two solid years after the acquisiton, Facebook leadership kept pressuring us to move off of AWS and on to FB infra. We kept saying “no, this is a bad idea; you have a flat network, and we allow developers all over the world to upload and execute random snippets of javascript,” and “no, this isn’t cost effective, because we run large multi-terabyte MongoDB replica sets by RAIDing together multiple EBS volumes, and you only have 2.5TB FusionIO (for extremely high-perf mysql/RocksDB) and 40 TB spinning rust volumes (for Hadoop), and also it’s impossible to shrink or slice up replsets”, and so forth. But they were adamant. “You don’t understand. We’re Facebook. We can do anything.” (Literal quote)

Finally we caved and got on board. We were excited! I announced the migration and started providing biweekly updates to the infra leadership groups. Four months later, when the migration was half done, I get a ping from the same exact members of Facebook leadership:

“What are you doing?!?”
“Migrating!”
“You can’t do that, there are security issues!”
“No it’s fine, we have a fix for it.”
“There are hardware issues!”
“No it’s cool, we got it.”
“You can’t do this!!!”

ANYWAY. To make an EXTREMELY long and infuriating story short, they pulled the plug and canned the whole project. So I printed up a ten foot long “Mission Accomplished” banner (courtesy of George W Bush on the aircraft carrier), used Zuck’s credit card to buy $800 of top-shelf whiskey delivered straight to my desk (and cupcakes), and we threw an angry, ranty party until we all got it out of our systems.

Blue Hair

I honestly don’t remember what this one was about, but I have extensive photographic evidence to prove that I shaved the heads of and/or dyed the hair blue of at least seven members of engineering. I wish I could remember why! but all I remember is that it was fucking hilarious.

In Conclusion

Coincidentally (or not), I have no memories of participating in any rituals at the jobs I didn’t like, only the jobs I loved. Huh.

One thing that stands out in my mind is that all the fun rituals tend to come bottoms-up. A ritual that comes from your VP can run the risk of feeling like forced fun, in a way it doesn’t if it’s coming from your peer or even your manager. I actually had the MOST fun with this shit as a line manager, because 1) I had budget and 2) it was my job to care about teaminess.

There are other rituals that it does make sense for executives to create, but they are less about hilarious fun and more about reinforcing values. Like Amazon’s infamous door desks are basically just a ritual to remind people to be frugal.

Rituals tend to accrue mutations and layers of meaning as time goes on. Great rituals often make no sense to anybody who isn’t in the know — that’s part of the magic of belonging. 🥰

Now, go tell me about yours!

charity

Live Your Best Life With Structured Events

August 15, 2022July 14, 2023 mipsytipsyengineering, events, logs, metrics, monitoring, traces6 Comments

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

In brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit. Let’s start with observability.

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger. Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests. You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative. And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data. You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events … are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong. They’re still thinking about these like log lines. And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide. Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

initialize an empty debug event when the request enters that service
prepopulate the empty debug event with any and all interesting information that we already know or can guess. language type, version, environment, etc.
create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
pass the event along and maintain its state until you are ready to error or exit
write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

Metadata like src, dst headers
The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
Every raw db query, normalized query family, execution time etc
Infra details like AZ, instance type, provider
Language/environment context like $lang version, build flags, $ENV variables
Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
Any other useful application context. Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
Possibly the system resource state at this point in time. e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip. You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id. Databases may be different. For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event. SO. Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id. Therefore it generates one observability event per query or transaction.

In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.

For more on observability for databases and other black boxes, try this blog post.

The Truth About “MEH-TRICS”

April 13, 2022January 10, 2024 mipsytipsycrossposted, engineering, logs, metrics, monitoringLeave a comment

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?” metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.”

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like this, this, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

Twin Anxieties of the Engineer/Manager Pendulum

March 24, 2022July 14, 2023 mipsytipsycareers, culture, engineering, management, pendulum, tech culture7 Comments

I have written a lot about the pendulum swing between engineering and management, so I often hear from people who are angsting about the transition.

A quick recap of the relevant posts:

There are two anxieties I hear people express above all the rest.

The first one is something I hear over and over again, particularly from first-time managers as they contemplate the possibility of leaving management and returning to IC (individual contributor) work as an engineer:

“What if I never get another shot at people management?”
“Maybe this is the only chance I’ll ever get … and I’m about to give it up??”
“Am I going to regret this?”

“Will I ever get another shot at management?”

People decide to go back to engineering for lots of reasons. Maybe they’re burned out, or they work someplace with a poisonous management culture, or they’re having a kid and want to return to a role that feels more comfortable for a while. Or maybe they’ve been managing teams for a few years now, and have decided it’s time to go back to the well and refresh their technical skills in the interest of their long-term employability.

Regardless, these are not typically people who disliked being a manager. Rather they tend to be engineers who really enjoyed people management, and find it bittersweet to give up. Maybe they will miss the strategic elements and roadmap work, but they’re excited to clear their calendar and spend time in flow again, or they will miss having 1x1s but can’t wait to have time to mentor people. Whatever. They want to manage teams again someday, and worry they won’t get another chance.

Their anxiety is understandable! Lots of people feel like they waited a long time to be tapped for management, or like they were passed over again and again. Our cultural scripts about management definitely contribute to this sense of scarcity and diminution of agency (i.e. that management is a promotion, it is bestowed on you by your “superiors” as a reward for your performance, and it is pushy or improper to openly seek the role for yourself).

This anxiety is also, in my experience, ridiculously misplaced. ☺️

Once a manager, marked for life as a manager

You may have struggled to get your first opportunity to manage a team. But it’s a whole different story once you’ve done the job. Now you have the skills and the experience, and people can smell it on you.

I’m not joking. If you’re a good manager it’s actually nearly impossible to hide that you have the skills, because of the way it infuses your work and everything that you do as an IC. You get better at prioritization, more attuned to the needs of the business, and restless about work that doesn’t materially move the business forward. You get better at asking questions about why things need to be done and at communicating with stakeholders. You get better at motivating the people you work with, understanding their motivations and your own, and mediating conflicts or putting a damper on drama between peers. People come to you for advice and may seem to just do what you say, or go where you point.

Senior engineers with management experience are worth their weight in gold. They are valuable contributors and influential teammates. It’s a palpable shift! And every experienced manager in their vicinity will sense it.

So yes, you will be tapped for management again. And again and again and again. You are more likely to spend the rest of your career fending off management “opportunities” with a baseball bat than you are to wither away, pining for another shot.

There is a chronic shortage of good engineering managers, just like there is a chronic shortage of good, empathetic managers in every line of work. The challenge you will face from now on will not be about getting the chance to manage a team, but about being intentional and firm in carving out the time you need to recover and recharge your skills as an engineer.

“Am I too rusty to go back to engineering?”

The second anxiety is in some ways a mirror of the first:

“Can I still perform as an engineer?”
“Will anyone hire me for an engineering role?”
“Has it been too long, am I too rusty, will I be able to pull my weight?”

This is a more materially valid concern than the first one, in my opinion. Your engineering skills do wither and erode as time goes on. It will take longer and longer to refresh your skills the longer you go without using them. Management skills don’t decay in the same way that technical ones do, nor do they go out of date every few years as languages, frameworks and technologies tend to do.

If you aren’t interested in climbing the ladder and becoming a director or VP — or rather, if you aren’t actively, successfully climbing the ladder — you should have a strategy for keeping your hands-on skills sharp, because your ability to be a strong line manager is grounded in your own engineering skills.

Never, ever accept a managerial role until you are already solidly senior as an engineer. To me this means at least seven years or more writing and shipping code; definitely, absolutely no less than five. It may feel like a compliment when someone offers you the job of manager — hell, take the compliment 🙃 — but they are not doing you any favors when it comes to your career or your ability to be effective.

When you accept your first manager job, I think you should make a commitment to yourself to stick it out for two years. That’s how long it takes to rewire your instincts and synapses, to learn enough that you can tell whether you’re doing a good job or not.

After two or three years of management, it’s still pretty easy to go back to engineering. After five years, it gets progressively harder. But it can be done. And it should be worth it to your employer to invest in keeping you while you refresh your skills over the six months or whatever it may take. Insist on it, if you must. It’s better to refresh your skills while employed, on a system and codebase you’re familiar with, than to find yourself struggling to brush up enough to pass a coding interview.

Engineering fluency == job security

There is one more reason to refresh your engineering skills from time to time, one I don’t often see mentioned, and that is job security and optionality.

The higher you go up the ladder, the more money you will get paid…but the fewer jobs there be, and the fewer still that match your profile.

As a senior software engineer, there are fifteen bajillion job openings for you. Everyone wants to hire you. You can get a new job in a matter of days, no matter how picky you want to be about location, flexibility, technologies, product types, whatever. You’ve reached Peak Hire.

If you are looking for management roles, there will be an order of magnitude fewer opportunities (and more idiosyncratic hiring criteria), but still plenty for the most part. But for every step up the ladder you go, the opportunities drop by another order of magnitude, and the scrutiny becomes much more intense and particular. If you’re looking for VP roles, it may take months to find a place you want to work at, and then they might not choose you. ¯\_(ツ)_/¯

Maintaining your technical chops is a stellar way to hedge against uncertainties and maintain your optionality.

Software deploys and cognitive biases

August 27, 2021July 14, 2023 mipsytipsycontinuous deployment, deploys, engineering, friday deploys, operations, sre1 Comment

There exist some wonderful teams out there who have valid, well thought through, legitimate reasons for enforcing “NO FRIDAY DEPLOYS” week in and week out, for not hooking CI/CD up to autodeploy, and for not shipping one person’s changes at a time.

And then there are the reasons most people have.

Bad decisions, and the biases they came from

“It is more dangerous to deploy than not to deploy” (prevention bias, zero-risk bias)
“It’s always riskier to do something than to not do something” (omission bias)
“Deploys are scary, so we need to slow down and be careful.”(slow motion bias)
“We’ve always done it this way; it works fine, you’re exaggerating, it doesn’t slow us down like those stories you tell.” (plan continuation bias, mere-exposure effect, time-saving bias, status-quo bias, default effect)
“Maybe that works for some teams, like baby startups, but not real software systems that people rely on” (false-uniqueness bias)
“We’ve already invested a ton of engineering hours into building a deployment framework that doesn’t ship especially fast and doesn’t ship one change at a time, but it works, and we don’t want to have to redo everything from scratch.” (surrogation, ostrich effect, irrational escalation, ikea-effect, law of the instrument)
“I get why it’s important, but the most important thing right now is for us to ship all of these features. At some point we’ll have the spare time to fix our deploys.” (hyperbolic discounting)
“We tried it once, and the whole site went down. Never again.” (non-adaptive choice switching, selective perception)
“Everybody says that no Friday deploys is the safe, sane, and caring choice.” (illusory truth effect, surrogation, continued influence effect, conservatism bias)
“Deploys are just inherently scary; there’s nothing that can be done about that. You should do them sparingly and with someone monitoring it closely.” (availability bias, dread aversion, functional fixedness)
“The best way to protect people’s personal time is to let merged code pile up between Thursday night and Monday morning, and ship all at once.” (continued-influence effect, plan-continuation bias, pseudocertainty effect, zero-risk bias)
“There is nothing anyone could say or do to convince me that the best thing I can do as a manager to protect their weekends is not blocking Friday deploys. I just feel it in my gut. I just know.” (… I got nuthin)

We’re humans. 💜 We leap to conclusions with the wetware we have doing the best it can based on heuristics that feel objectively true, but are ultimately just emotional reactions based on past lived experience. And then we retroactively enshrine those goofy gut feelings with the language of noble motive and moral values.

“I tell people not to deploy to production … because I care so deeply about my team and their ability to have a quiet weekend.”

Barf. 🙄 That’s just like saying you tell your kid not to brush his teeth at night, because you care SO DEEPLY about him and his ability to go to bed calm and happy.

Once the retcon engine in your brain gets running, it comes up with all sorts of reasons. Plausible-sounding reasons! But every single argument of the items in the list above is materially false.

Deploy myths are never going away for good; they appeal to too many of our cognitive biases. But what if there was one simple thing you could do that would invert many of these cognitive biases and cause people to grapple with the question in a new way? What if you could kickstart a recalculation?

My next post will pick up right here. I’ll tell you all about the One Simple Trick you can do to fix your deploys and set you on the virtuous path of high-performing teams.

Til then, here’s what I’ve previously written on the topic.

https://charity.wtf/2018/08/19/shipping-software-should-not-be-scary/

https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like-murdering-puppies/

https://charity.wtf/2019/10/28/deploys-its-not-actually-about-fridays/

https://charity.wtf/2021/02/19/how-much-is-your-fear-costing-you/

https://charity.wtf/2020/12/31/why-are-my-tests-so-slow-a-list-of-likely-suspects-anti-patterns-and-unresolved-personal-trauma/

Footnotes

Availability bias: The tendency to overestimate the likelihood of events with greater “availability” in memory, which can be influenced by how recent the memories are or how unusual or emotionally charged they may be.

Continued influence effect: The tendency to believe previously learned misinformation even after it has been corrected. Misinformation can still influence inferences one generates after a correction has occurred.

Conservatism bias: The tendency to revise one’s belief insufficiently when presented with new evidence.

Default effect: When given a choice between several options, the tendency to favor the default one.

Dread aversion: Just as losses yield double the emotional impact of gains, dread yields double the emotional impact of savouring

False-uniqueness bias: The tendency of people to see their projects and themselves as more singular than they actually are.

Functional fixedness: Limits a person to using an object only in the way it is traditionally used

Hyperbolic discounting: Discounting is the tendency for people to have a stronger preference for more immediate payoffs relative to later payoffs. Hyperbolic discounting leads to choices that are inconsistent over time – people make choices today that their future selves would prefer not to have made, despite using the same reasoning

IKEA effect: The tendency for people to place a disproportionately high value on objects that they partially assembled themselves, such as furniture from IKEA, regardless of the quality of the end product

Illusory truth effect: A tendency to believe that a statement is true if it is easier to process, or if it has been stated multiple times, regardless of its actual veracity.

Irrational escalation: The phenomenon where people justify increased investment in a decision, based on the cumulative prior investment, despite new evidence suggesting that the decision was probably wrong. Also known as the sunk cost fallacy

Law of the instrument: An over-reliance on a familiar tool or methods, ignoring or under-valuing alternative approaches. “If all you have is a hammer, everything looks like a nail”

Mere exposure effect: The tendency to express undue liking for things merely because of familiarity with them

Negativity bias: Psychological phenomenon by which humans have a greater recall of unpleasant memories compared with positive memories

Non-adaptive choice switching: After experiencing a bad outcome with a decision problem, the tendency to avoid the choice previously made when faced with the same decision problem again, even though the choice was optimal

Omission bias: The tendency to judge harmful actions (commissions) as worse, or less moral, than equally harmful inactions (omissions).

Ostrich effect: Ignoring an obvious (negative) situation

Plan continuation bias: Failure to recognize that the original plan of action is no longer appropriate for a changing situation or for a situation that is different than anticipated

Prevention bias: When investing money to protect against risks, decision makers perceive that a dollar spent on prevention buys more security than a dollar spent on timely detection and response, even when investing in either option is equally effective

Pseudocertainty effect: The tendency to make risk-averse choices if the expected outcome is positive, but make risk-seeking choices to avoid negative outcomes

Salience bias: The tendency to focus on items that are more prominent or emotionally striking and ignore those that are unremarkable, even though this difference is often irrelevant by objective standards

Selective perception bias: The tendency for expectations to affect perception

Status-quo bias: If no special action is taken, the default action that will happen is that the code will go live. You will need an especially compelling reason to override this bias and manually stop the code from going live, as it would by default.

Slow-motion bias: We feel certain that we are more careful and less risky when we slow down. This is precisely the opposite of the real world risk factors for shipping software. Slow is dangerous for software; speed is safety. The more frequently you ship code, the smaller the diffs you ship, the less dangerous each one actually becomes. This is the most powerful and difficult to overcome of all of our biases, because there is no readily available counter-metaphor for us to use. (Riding a bike is the best I’ve come up with. 😔)

Surrogation: Losing sight of the strategic construct that a measure is intended to represent, and subsequently acting as though the measure is the construct of interest

Time-saving bias: Underestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively low speed and overestimations of the time that could be saved (or lost) when increasing (or decreasing) from a relatively high speed.

Zero-risk bias: Preference for reducing a small risk to zero over a greater reduction in a larger risk.