Live Your Best Life With Structured Events

If you’re like most of us, you learned to debug as a baby engineer by way of printf(3). By the time you were shipping code to production you had probably learned to instrument your code with a real metrics library. Maybe a tenth of us learned to use gdb and still step through functions on the regular. (I said maybe.)

Printing stuff to stdout is still the Swiss Army knife of tools. Always there when you reach for it, usually helps more than it does harm. (I said usually.)

And then! In case you’ve been living under a rock, we recently went and blew up ye aulde monolythe, and in the process we … lost most of our single-process tools and techniques for debugging. Forget gdb; even printf doesn’t work when you’re hopping the network between functions.

If your tool set no longer works for you, friend, it’s time to go all in. Maybe what you wanted was a faster horse, but it’s time for a car, and the sooner you turn in your oats for gas cans and a spare tire, the better.

Exercising Good Technical Judgment (When You Don’t Have Any)

If you’re stuck trying to debug modern problems with pre-modern tooling, the first thing to do is stop digging the hole. Stop pushing good data after bad into formats and stores that aren’t going to help you answer the right questions.

0893d048d8361fe632b090b0429ad78b-rainbow-dash-rainbows-e1542789580565.jpgIn brief: if you aren’t rolling out a solution based on arbitrarily wide, structured raw events that are unique and ordered and trace-aware and without any aggregation at write time, you are going to regret it. (If you aren’t using OpenTelemetry, you are going to regret that, too.)

So just make the leap as soon as possible.

But let’s rewind a bit.  Let’s start with observability.

 

Observability: an introduction

Observability is not a new word or concept, but the definition of observability as a specific technical term applied to software engineering is relatively new — about four years old. Before that, if you heard the term in softwareland it was only as a generic synonym for telemetry (“there are three pillars of observability”, in one annoying formulation) or team names (twitter, for example, has long had an “observability team”).

The term itself originates with control theory:

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. The concept of observability was introduced by Hungarian-American engineer Rudolf E. Kálmán for linear dynamic systems.[1][2]”

But when applied to a software context, observability refers to how well you can understand and reason about your systems, just by interrogating them and inspecting their outputs with your tools. How well can you understand the inside of the system from the outside?

Achieving this relies your ability to ask brand new questions, questions you have never encountered and never anticipated — without shipping new code. Shipping new code is cheating, because it means that you knew in advance what the problem was in order to instrument it.

But what about monitoring?

Monitoring has a long and robust history, but it has always been about watching your systems for failures you can define and expect. Monitoring is for known-unknowns, and setting thresholds and running checks against the system. Observability is about the unknown-unknowns. Which requires a fundamentally different mindset and toolchain.

“Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.” — @grepory, in his talk “Monitoring is Dead“.

Monitoring is a third-person perspective on your software. It’s not software explaining itself from the inside out, it’s one piece of software checking up on another.

Observability is for understanding complex, ephemeral, dynamic systems (not for debugging code)

You don’t use observability for stepping through functions; it’s not a debugger.  Observability is for swiftly identifying where in your system the error or problem is coming from, so you can debug it — by reproducing it, or seeing what it has in common with other erroring requests.  You can think of observability as being like B.I. (business intelligence) tooling for software applications, in the way you engage in a lot of exploratory, open-ended data sifting to detect novel patterns and behaviors.

rainbow_dash___no_by_cptofthefriendship-d4erd69Observability is often about swiftly isolating or tracking down the problem in your large, sprawling, far-flung, dynamic system. Because the hard part of distributed systems is rarely debugging the code, it’s figuring out where the code you need to debug is.

The need for observability is often associated with microservices adoption, because they are prohibitively difficult to debug without service-level event oriented tooling — the kind you can get from Honeycomb and Lightstep.. and soon, I hope, many other vendors.

Events are the building blocks of observability

Ergh, another overloaded data term. What even is an “event”?

An observability “event” is a hop in the lifecycle of an end-to-end request. If a request executes code on three services separated by network hops before returning to the user, that request generated three observability “events”, each packed with context and details about that code running in that environment. These are also sometimes called “canonical log lines“. If you implemented tracing, each event may be a span in your trace.

If request ID #A897BEDC hits your edge, then your API service, then four more internal services, and twice connects to a db and runs a query, then request ID #A897BEDC generated 8 observability events … assuming you are in fact gathering observability data from the edge, the API, the internal services and the databases.ponyfm-i7812-original

This is an important caveat. We only gather observability events from services that we can and do introspect. If it’s a black box to us, that hop cannot generate an observability event. So if request ID #A897BEDC also performed 20 cache lookups and called out to 8 external HTTP services and 2 managed databases, those 30 hops do not generate observability events (assuming you haven’t instrumented the memcache service and have no instrumentation from those external services/dbs). Each request generates one event per request per service hop.**

(I also wrote about logs vs structured events here.)

Observability is a first-person narrative.

We care primarily about self-reported status from the code as it executes the request path.

Instrumentation is your eyes and ears, explaining the software and its environment from the perspective of your code. Monitoring, on the other hand, is traditionally a third-person narrative — it’s one piece of software checking up on another piece of software, with no internal knowledge of its hopes and dreams.

First-person narrative reports have the best potential for telling a reliable narrative.  And more importantly, they map directly to user experience in a way that third-party monitoring does not and cannot.

Events … must be structured.

First, structure your goddamn data.  You’re a computer scientist, you’ve got no business using text search to plow through terabytes of text.

Events …  are not just structured logs.

Now, part of the reason people seem to think structured data is cost-prohibitive is that they’re doing it wrong.  They’re still thinking about these like log lines.  And while you can look at events like they’re just really wide structured log lines that aren’t flushed to disk, here’s why you shouldn’t: logs have decades of abhorrent associations and absolutely ghastly practices.

Instead of bundling up and passing along one neat little pile of context, they’re spewing log lines inside loops in their code and DDoS’ing their own logging clusters.They’re shitting out “log lines” with hardly any dimensions so they’re information-sparse and just straight up wasting the writes. And then to compensate for the sparseness and repetitiveness they just start logging the same exact nouns tens or hundreds of times over the course of the request, just so they can correlate or reconstruct some lousy request that they never should have blown up in the first place!

But they keep hearing they should be structuring their logs, so they pile structure on to their horrendous little strings, which pads every log line by a few bytes, so their bill goes up but they aren’t getting any benefit! just paying more! What the hell, structuring is bull shit!giphy

Kittens. You need a fundamentally different approach to reap the considerable benefits of structuring your data.

But the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛

Events … must be arbitrarily wide and dense with context.

So the most effective way to structure your instrumentation, to get the absolute most bang for your buck, is to emit a single arbitrarily wide event per request per service hop. At Honeycomb, the maturely instrumented datasets that we see are often 200-500 dimensions wide.  Here’s an event that’s just 20 dimensions wide:

{ 

   "timestamp":"2018-11-20 19:11:56.910",
   "az":"us-west-1",
   "build_id":"3150",
   "customer_id":"2310",
   "durationMs":167,
   "endpoint":"/api/v2/search",
   "endpoint_shape":"/api/v2/search",
   "fraud_dur":131,
   "hostname":"app14",
   "id":"f46691dfeda9ede4",
   "mysql_dur":"",
   "name":"/api/v2/search",
   "parent_id":"",
   "platform":"android",
   "query":"",
   "serviceName":"api",
   "status_code":200,
   "traceId":"f46691dfeda9ede4",
   "user_id":"344310",
   "error_rate":0,
   "is_root":"true"
}

So a well-instrumented service should have hundreds of these dimensions, all bundled around the context of each request. And yet — and here’s why events blow the pants off of metrics — even with hundreds of dimensions, it’s still just one write. Adding more dimensions to your event is effectively free, it’s still one write plus a few more bits.

Compare this to a metric-based systems, where you are often in the position of trying to predict whether a metric will be valuable enough to justify the extra write, because every single metric or tag you add contributes linearly to write amplification. Ever gotten billed tens of thousands of dollars for your custom metrics, or had to prune your list of useful custom metrics down to something affordable? (“BUT THOSE ARE THE ONLY USEFUL ONES!”, as every ops team wails)

Events … must pass along the blob of context as the request executes

As you can imagine, it can be a pain in the ass to keep passing this blob of information along the life of the request as it hits many services and databases. So at Honeycomb we do all the annoying parts for you with our integrations. You just install the go pkg or ruby gem or whatever, and under the hood we:

  1. initialize an empty debug event when the request enters that service
  2. prepopulate the empty debug event with any and all interesting information that we already know or can guess.  language type, version, environment, etc.
  3. create a framework so you can just stuff any other details in there as easily as if you were printing it out to stdout
  4. pass the event along and maintain its state until you are ready to error or exit
  5. write the extremely wide event out to honeycomb

Easy!

(Check out this killer talk from @lyddonb on … well everything you need to know about life, love and distributed systems is in here, but around the 12:00 mark he describes why this approach is mandatory. WATCH IT. https://www.youtube.com/watch?v=xy3w2hGijhE&feature=youtu.be)

Events … should collect context like sticky buns collect dust

Other stuff you’ll want to track in these structured blobs includes:

1225287_1370081029072_full

  1. Metadata like src, dst headers
  2. The timing stats and contents of every network call (our beelines wrap all outgoing http calls and db queries automatically)
  3. Every raw db query, normalized query family, execution time etc
  4. Infra details like AZ, instance type, provider
  5. Language/environment context like $lang version, build flags, $ENV variables
  6. Any and all unique identifying bits you can get your grubby little paws on — request ID, shopping cart ID, user ID, request ID, transaction ID, any other ID … these are always the highest value data for debugging.
  7. Any other useful application context.  Service name, build id, ordering info, error rates, cache hit rate, counters, whatever.
  8. Possibly the system resource state at this point in time.  e.g. values from /proc/net/ipv4 stats

Capture all of it. Anything that ever occurs to you (“this MIGHT be handy someday”) — don’t even hesitate, just throw it on the pile. Collect it up in one rich fat structured blob.

Events … must be unique, ordered, and traceable

You need a unique request ID, and you need to propagate it through your stack in some way that preserves sequence. Once you have that, traces are just a beautiful visualization layer on top of your shiny event data.

Events … must be stored raw.

Because observability means you need to be able to ask any arbitrary new question of Rainbow-Dash-is-not-amused-my-little-pony-friendship-is-magic-31088082-900-622your system without shipping new code, and aggregation is a one-way trip. Once you have aggregated your data and discarded the raw requests, you have destroyed your ability to ask new questions of that data forever. For Ever.

Aggregation is a one-way trip.  You can always, always derive your pretty metrics and dashboards and aggregates from structured events, and you can never go in reverse. Same for traces, same for logs. The structured event is the gold standard. Invest in it now, save your ass in the future.

It’s only observability if you can ask new questions. And that means storing raw events.

Events…are richer than metrics

There’s always tradeoffs when it comes to data. Metrics choose to sacrifice context and connective tissue, and sometimes high cardinality support, which you need to correlate anomalies or track down outliers. They have a very small, efficient data format, but they sacrifice everything else by discarding all but the counter, gauge, etc.

A metric looks like this, by the way.

{ metric: "db.query.time", value: 0.502, tags: Array(), type: set }

That’s it. It’s just a name, a number and maybe some tags. You can’t dig into the event and see what else was happening when that query was strangely slow. You can never get that information back after discarding it at write time.

But because they’re so cheap, you can keep every metric for every request! Maybe. (Sometimes.) More often, what happens is they aggregate at write time. So you never actually get a value written out for an individual event, it smushes everything together that happens in the 1 second interval and calculates some aggregate values to write out. And that’s all you can ever get back to.

With events, and their relative explosion of richness, we sacrifice our ability to store every single observability event about every request. At FB, every request generated hundreds of observability events as it made its way through the stack. Nobody, NOBODY is going to pay for an o11y stack that is hundreds of times as large as production. The solution to that problem is sampling.

Events…should be sampled.rainbow_dash___no_by_cptofthefriendship-d4erd69

But not dumb, blunt sampling at server side. Control it on the client side.

Then sample heavily for events that are known to be common and useless, but keep the events that have interesting signal. For example: health checks that return 200 OK usually represent a significant chunk of your traffic and are basically useless, while 500s are almost always interesting. So are all requests to /login or /payment endpoints, so keep all of them. For database traffic: SELECTs for health checks are useless, DELETEs and all other mutations are rare but you should keep all of them. Etc.

You don’t need to treat your observability metadata with the same care as you treat your billing data. That’s just dumb.

… To be continued.

I hope it’s now blazingly obvious why observability requires — REQUIRES — that you have access to raw structured events with no pre-aggregation or write-time rollups. Metrics don’t count. Just traces don’t count. Unstructured logs sure as fuck don’t count.

Structured, arbitrarily wide events, with dynamic sampling of the boring parts to control costs. There is no substitute.

For more about the technical requirements for observability, read this, this, or this.

IMG_4619
**The deep fine print: it’s one observability event per request per service hop … because we gather observability detail organized by request id.  Databases may be different.  For example, with MongoDB or MySQL, we can’t instrument them to talk to honeycomb directly, so we gather information about its internal perspective by 1) tailing the slow query log (and turning it up to log all queries if perf allows), 2) streaming tcp over the wire and reconstructing transactions, 3) connecting to the mysql port as root every couple seconds from cron, then dumping all mysql stats and streaming them in to honeycomb as an event.  SO.  Database traffic is not organized around connection length or unique request id, it is organized around transaction id or query id.  Therefore it generates one observability event per query or transaction. 
In other words: if your request hit the edge, API, four internal services, two databases … but ran 1 query on one db and 10 queries on the second db … you would generate a total of *19 observability events* for this request.
For more on observability for databases and other black boxes, try this blog post.
Live Your Best Life With Structured Events

The Truth About “MEH-TRICS”

First published on 2022-04-13 at https://www.honeycomb.io/blog/truth-about-meh-trics-metrics

A long time ago, in a galaxy far, far away, I said a lot of inflammatory things about metrics.

“Metrics are shit salad.”

“Metrics are simply nerfed dimensions.”

“Metrics suck,” “metrics are legacy,” “metrics and time series aggregates will fucking kneecap you.”

I cannot tell a lie; Twitter will testify that I’ve spent the past six years ragging on metrics. So much so that ever since we launched Honeycomb Metrics last year, our poor solution architects have been encountering skeptics in the field who repeat my quotes back to them and ask, dubiously, whether Honeycomb Metrics are any good or not, and whether we genuinely plan on investing in it or not, given our known anti-metrics sympathies.

That’s a great question. 😊

Metrics aren’t worthless; they’re just limited.

Metrics are a mature technology that’s been around for over 30 years, and they have some real advantages. They’re tiny, fast, and cheap; you can hold a bunch of them in memory as counters, summaries, and gauges. They aggregate well and take up a fixed amount of storage space. The entire monitoring industry is built on top of metrics.

When it comes to workloads like, “How heavy is the write load on my hard drive?” or “What is the temperature or fan status inside my chassis?” or “What is the traffic rate in and out of this interface on my switch?”  metrics are what you should use. In fact, pretty much any time you want to know the health of a system or component in toto, metrics are the right tool.

Because that’s what metrics do best—report statistics in aggregate, from the perspective of any system or component. They can tell you that your Ruby HTTP worker pool is 70% utilized or that your nginx webserver is returning 502s 1% of the time. What they can’t tell you is what this means for any one of your users, applications, delivery vehicles, and so forth.

Until recently, metrics-based tools or logs were the only game in town. People were trying to sell us metrics tools for observability use cases, and that’s what got my goat so badly. If you simply append “… for observability” to each of my inflammatory statements, then I stand by them completely.

“Metrics are shit salad … for observability.

Yup, rings true.

You’re never going to make a metrics tool like Prometheus or Datadog into an observability tool. You’re just not. Observability is about unknown-unknowns, while metrics are a tool for known-unknowns.

If you need a refresher on the differences between observability and monitoring, I’ll refer you to pieces like thisthis, and this. What I want to talk about here is slightly different. In a post-observability world, what is the true and proper place for metrics tooling?

Metrics and observability have different use cases.

Metrics aren’t completely useless, even if you have a robust observability presence. We still use metrics at Honeycomb to this day for certain workloads—and always will because they’re the right tool for the job.

There are two kinds of workloads, roughly speaking: your code—the code you write, review, ship, debug and maintain on a daily basis. And other people’s code—the code you have to run and use in order to support your code. Some examples of the latter might be: Linux, Docker, MySql, Amazon RDS, Kafka, AWS Lambda, GCP gateways, memcache, CI/CD pipelines, Kubernetes, etc.

Your code is your crown jewels, the code you need to survive and succeed as a business. It changes constantly—many times per week, if not per day. You are expected to understand its inner workings intimately, and spend lots of time chasing down bugs or understanding and reproducing behavior. You care about the way it performs and interacts with each and every individual user, with changing infrastructure state, and under a variety of different load conditions.

That is why your code demands observability. In order to understand your software, you must first instrument it, in a way that collects lots of rich context and bundles it up around each event end-to-end. Then you need to stream those events into a tool that lets you slice and dice and trace and explore with support for high-cardinality and high-dimensionality data. That’s the only way you’re going to be able to correlate errors, track down outliers, and reflect each user’s experience.

But what about the rest of the software? You can’t instrument Amazon RDS, and only crazy people would instrument, rebuild, and repackage things like Kafka or Docker or nginx. The whole point of third-party software is that you DON’T USE IT until it’s stable enough to be taken more or less for granted. Sure, you roll updates, but usually on the order of months or years—not every day. You don’t need to be intimately familiar with its inner workings because you aren’t changing it every day. Those aren’t your crown jewels.

You do care about their health though, only differently. You care about whether you need to provision more capacity or not. You care about knowing how hard you’re hammering on the underlying hardware or hypervisor. That’s why metrics and monitoring are the right tools to use for third-party code. They don’t let you peer under the hood in the same way, or slice and dice in the same way, but that’s okay. You shouldn’t have to.

With third-party stuff, you don’t care about the code, you care about the health of the service. In aggregate.

(There are some kinds of in-between software, like databases, where event-level information is super useful for debugging things like slow queries and lock percentages, and you can use various black box techniques to approximate observability without instrumentation. But in general this model holds up quite well.)

In a post-observability world, what are metrics for?

I’ve often pointed out that observability is built on top of arbitrarily wide structured data blobs, and that metrics, logs, and traces can be derived from those blobs while the reverse is not true—you can’t take a bunch of metrics and reformulate a rich event.

And yes, people who have observability typically find themselves using metrics and dashboards less and less. They’re simply not as versatile or useful as events that you can slice and dice and manipulate in infinite ways. And you can derive aggregates and trends from the events you have stored.

But metrics will always be useful for understanding third-party software, from the perspective of the service, cluster, or node. They will always be the right tool for the job when it comes to software interfacing with hardware. And they can be super complementary when you are investigating your code using events and instrumentation.

If you’re an engineer writing and shipping code, you’re never not going to want to know if your change caused memory usage to triple, or CPU utilization to skyrocket, or disk usage or network throughput to saturate. That’s why we built Honeycomb Metrics as an overlay, a way to enhance or validate your understanding of the impact your code changes have had on the underlying system.

Metrics are also valuable as a bridge to the past. People have been instrumenting software for metrics for 30 years—they’re never going away completely, and not everything can or should be reinstrumented with events. Lots of people already have robust monitoring systems that slurp in millions of metrics. Nobody wants to have to redo all that work just because they’re moving to a different tool, so people tend to point their metrics firehose at Honeycomb as a way of getting started as they roll observability out into their code.

The Truth About “MEH-TRICS”

Observability is a Many-Splendored Definition

Last weekend, @swyx posted a great little primer to instrumentation titled “Observability Tools in JavaScript”.  A friend sent me the link and suggested that I might want to respond and clarify some things about observability, so I did, and we had a great conversation!  Here is a lightly edited transcript of my reply tweet storm.

First of all, confusion over terminology is understandable, because there are some big players out there actively trying to confuse you!  Big Monitoring is indeed actively trying to define observability down to “metrics, logs and traces”.  I guess they have been paying attention to the interest heating up around observability, and well… they have metrics, logs, and tracing tools to sell?  So they have hopped on the bandwagon with some undeniable zeal.

But metrics, logs and traces are just data types.  Which actually has nothing to do with observability.  Let me explain the difference, and why I think you should care about this.

“Observability? I do not think it means what you think it means.”

Observability is a borrowed term from mechanical engineering/control theory.  It means, paraphrasing: “can you understand what is happening inside the system — can you understand ANY internal state the system may get itself into, simply by asking questions from the outside?”  We can apply this concept to software in interesting ways, and we may end up using some data types, but that’s putting the cart before the horse.

It’s a bit like saying that “database replication means structs, longints and elegantly diagrammed English sentences.”  Er, no.. yes.. missing the point much?

This is such a reliable bait and switch that any time you hear someone talking about “metrics, logs and traces”, you can be pretty damn sure there’s no actual observability going on.  If there were, they’d be talking about that instead — it’s far more interesting!  If there isn’t, they fall back to talking about whatever legacy products they do have, and that typically means, you guessed it: metrics, logs and traces.

❌ Metrics

Metrics in particular are actually quite hostile to observability.  They are usually pre-aggregated, which means you are stuck with whatever questions you defined in advance, and even when they aren’t pre-aggregated they permanently discard the connective tissue of the request at write time, which destroys your ability to correlate issues across requests or track down any individual requests or drill down into a set of results — FOREVER.

Which doesn’t mean metrics aren’t useful!  They are useful for many things!  But they are useful for things like static dashboards, trend analysis over time, or monitoring that a dimension stays within defined thresholds.  Not observability.  (Liz would interrupt here and say that Google’s observability story involves metrics, and that is true — metrics with exemplars.  But this type of solution is not available outside Google as far as we know..)

❌ Logs

Ditto logs.  When I say “logs”, you think “unstructured strings, written out to disk haphazardly during execution, “many” log lines per request, probably contains 1-5 dimensions of useful data per log line, probably has a schema and some defined indexes for searching.”  Logs are at their best when you know exactly what to look for, then you can go and find it.

Again, these connotations and assumptions are the opposite of observability’s requirements, which deals with highly structured data only.  It is usually generated by instrumentation deep within the app, generally not buffered to local disk, issues a single event per request per service, is schemaless and indexless (or inferred schemas and autoindexed), and typically containing hundreds of dimensions per event.

❓ Traces

Traces?  Now we’re getting closer.  Tracing IS a big part of observability, but tracing just means visualizing events in order by time.  It certainly isn’t and shouldn’t be a standalone product, that just creates unnecessary friction and distance.  Hrmm … so what IS observability again, as applied to the software domain??

As a reminder, observability applied to software systems means having the ability to ask any question of your systems — understand any user’s behavior or subjective experience — without having to predict that question, behavior or experience in advance.

Observability is about unknown-unknowns.

At its core, observability is about these unknown-unknowns.

Plenty of tools are terrific at helping you ask the questions you could predict wanting to ask in advance.  That’s the easy part.  “What’s the error rate?”  “What is the 99th percentile latency for each service?”  “How many READ queries are taking longer than 30 seconds?”

  • Monitoring tools like DataDog do this — you predefine some checks, then set thresholds that mean ERROR/WARN/OK.
  • Logging tools like Splunk will slurp in any stream of log data, then let you index on questions you want to ask efficiently.
  • APM tools auto-instrument your code and generate lots of useful graphs and lists like “10 slowest endpoints”.

But if you *can’t* predict all the questions you’ll need to ask in advance, or if you *don’t* know what you’re looking for, then you’re in o11y territory.

  • This can happen for infrastructure reasons — microservices, containerization, polyglot storage strategies can result in a combinatorial explosion of components all talking to each other, such that you can’t usefully pre-generate graphs for every combination that can possibly degrade.
  • And it can happen — has already happened — to most of us for product reasons, as you’ll know if you’ve ever tried to figure out why a spike of errors was being caused by users on ios11 using a particular language pack but only in three countries, and only when the request hit the image export microservice running build_id 789782 if the user’s last name starts with “MC” and they then try to click on a particular button which then issues a db request using the wrong cache key for that shard.

Gathering the right data, then exploring the data.

Observability starts with gathering the data at the right level of abstraction, organized around the request path, such that you can slice and dice and group and  look for patterns and cross-correlations in the requests.

To do this, we need to stop firing off metrics and log lines willynilly and be more disciplined.  We need to issue one single arbitrarily-wide event per service per request, and it must contain the *full context* of that request. EVERYTHING you know about it, anything you did in it, all the parameters passed into it, etc.  Anything that might someday help you find and identify that request.

Then, when the request is poised to exit or error the service, you ship that blob off to your o11y store in one very wide structured event per request per service.

In order to deliver observability, your tool also needs to support high cardinality and high dimensionality.  Briefly, cardinality refers to the number of unique items in a set, and dimensionality means how many adjectives can describe your event.  If you want to read more, here is an overview of the space, and more technical requirements for observability

You REQUIRE the ability to chain and filter as many dimensions as you want with infinitely high cardinality for each one if you’re going to be able to ask arbitrary questions about your unknown unknowns.  This functionality is table stakes.  It is non negotiable.  And you cannot get it from any metrics or logs tool on the market today.

Why this matters.

Alright, this is getting pretty long. Let me tell you why I care so much, and why I want people like you specifically (referring to frontend engineers and folks earlier in their careers) to grok what’s at stake in the observability term wars.

We are way behind where we ought to be as an industry. We are shipping code we don’t understand, to systems we have never understood. Some poor sap is on call for this mess, and it’s killing them, which makes the software engineers averse to owning their own code in prod.  What a nightmare.

Meanwhile developers readily admit they waste >40% of their day doing bullshit that doesn’t move the business forward.  In large part this is because they are flying blind, just stabbing around in the dark.

We all just accept this.  We shrug and say well that’s just what it’s like, working on software is just a shit salad with a side of frustration, it’s just the way it is.

But it is fucking not.  It is un fucking necessary.  If you instrument your code, watch it deploy, then ask “is it doing what I expect, does anything else look weird” as a habit?  You can build a system that is both understandable and well-understood.  If you can see what you’re doing, and catch errors swiftly, it never has to become a shitty hairball in the first place.  That is a choice.

🌟 But observability in the original technical sense is a necessary prerequisite to this better world. 🌟

If you can’t break down by high cardinality dimensions like build ids, unique ids, requests, and function names and variables, if you cannot explore and swiftly skim through new questions on the fly, then you cannot inspect the intersection of (your code + production + users) with the specificity required to associate specific changes with specific behaviors.  You can’t look where you are going.

Observability as I define it is like taking off the blindfold and turning on the light before you take a swing at the pinata.  It is necessary, although not sufficient alone, to dramatically improve the way you build software.  Observability as they define it gets you to … exactly where you already are.  Which of these is a good use of a new technical term?

 

Do better.

And honestly, it’s the next generation who are best poised to learn the new ways and take advantage of them. Observability is far, far easier than the old ways and workarounds … but only if you don’t have decades of scar tissue and old habits to unlearn.

The less time you’ve spent using monitoring tools and ops workarounds, the easier it will be to embrace a new and better way of building and shipping well-crafted code.

Observability matters.  You should care about it.  And vendors need to stop trying to confuse people into buying the same old bullshit tools by smooshing them together and slapping on a new label.  Exactly how long do they expect to fool people for, anyway?

Observability is a Many-Splendored Definition

Logs vs Structured Events

I got an interesting tweet the other day from @evntdrvn in response to this thread of mine. Paraphrasing,

“So I’ve almost got our group at work up to Step 1 in your observability maturity model, but some of the devs that I work with want to turn OFF our lovely structured logging in prod for informational-level msgs due to their legacy philosophy (‘we only log errors in prod’). The reasons given are mostly philosophical (“I’m a dev and only interested when things error out, I don’t want any other noise in prod logs”, “I don’t want to slow my app down in prod”). Help?!?”

As I was reading this, I was itching to fly out and dive into battle with Eric. I know exactly where his opinionated devs are coming from. I used to say the same things! I even wrote a whole blog post about it.

These developers have internalized a set of rules and best practices for dealing with output data, in the context of “monolith application development in the early 2000s”.

Monolithic systems assumptions

Those systems had many common constraints and assumptions, such as:

  • We have a monolith service, or a very small number of services. We can model the system in our heads.
  • Logging is done to local disk, which can impact performance
  • Disks are expensive
  • Log lines are spat out inline with execution.  A poorly placed printf can take the whole system down.
  • Investigation is rare, and usually means a human reading error logs.
  • Logging is of poor utility for understanding internal states or execution paths; you should just read the code or use a debugger.  (There are few or network hops between functions.)
  • Logging is mostly useful for detecting certain terminal crash states or connection errors.

Monolithic logging best practices

Therefore:

  • We should be very stingy in what we log
  • Debuggers should be used for understanding internal states of the code
  • Logs are a last resort and record of crash dumps.  We do not expect to use log data in the course of our daily work.  We assume log-related manual investigation will be infrequent and of limited utility.

These were exactly the right lessons to learn in the era of expensive hardware and monolithic repos/artifacts. Many people still work in environments like this, and follow logging best practices like these. God bless, more power to em.

Distributed systems assumptions

But more and more of us face systems that are very different.

  • We have many services, possibly many MANY services. A representative request will have “many” hops across “many” services and routers and proxies and meshes and storage systems.
  • We cannot model the system in our heads; it would be a mistake to try. We rely on tooling as the source of truth for those systems.
  • You may or may not have access to those services, or the systems your code runs on. There may or may not be a logging facility, or a centralized log aggregator. Your only view of the system is through the instrumentation of your code.
  • Disks and system resources are cheap, ephemeral, all but disposable.
  • Data services are similarly cheap.  We can almost entirely silo application performance off from the cost of writing perf data out.
  • Investigation is prohibitively slow and expensive for a human to do by hand. Many of the nodes or processes we need to inspect may no longer even exist, but their past states may still be relevant to us in understanding patterns to the present time.
  • Investigation should usually be done distributedly, across all instantiations of your code, however many there might be — and in real time
  • Investigation requires computation — not just string search. We need to ask on the fly involving math and percentiles and breakdowns and group by’s.  And we need access to the raw requests in order to run accurate computations — no pre-aggregates.
  • The hardest part isn’t usually debugging the code, it’s figuring out where is the code you need to debug. Or what the errors or outliers have in common from the perspective of the code.  Fixing the code itself is often comparatively trivial, once found.
  • What even is ‘logging’?
  • What even is ‘local disk’?

This isn’t optional: at some point of complexity or scale or distributedness, it becomes necessary if you want to work with these systems.

Logs can’t help you here.

And you aren’t going to get that kind of explorable data out of loglevel:ERROR, or by chopping up your telemetry into disconnected metrics devoid of context.

You are only going to get this kind of explorable, ad hoc, computation-friendly data if you take a radically new approach to how you output and aggregate telemetry.  You’re going to need to replace your log lines and log levels with a different sort of beast: arbitrarily wide structured events that describe the request and its context, one event per request per service.

If it helps, don’t think of them as log files any more. Think of them as events. Yes, you can stash this stream in a file, but why would you?  on what disk?  will that work for your serverless functions too?  Just stream them over the network to wherever you want to put them.

Log levels are another confusing and unnecessary artifact of yesteryear that you no longer really need. The more you think of structured events as logs, the more tempted you may be to apply the old set of best practices. So just don’t think of them as logs at all.

How to gather and structure your data

Instead of dribbling little pebbles of log effluvia throughout your code, do this.  (If you’re a honeycomb user, our beelines do it all automatically for you *and* pre-propagate the blobs with everything we know of your context.)

  1. Initialize an empty blob at the beginning, when the request first enters the service.
  2. Stuff any and all interesting detail about the request into that blob throughout the lifetime of the request.
    • Any unique id, any high-cardinality variable, any headers passed in, every full query, normalized query, and query execution time; every http call out to a remote service, every http execution time; any shopping cart id, first and last name, execution time — literally anything interesting, append to blob.
  3. Then, when the request is about to exit or error, write the blob off to honeycomb or another service or disk somewhere.

You can see immediately how this method has radically different performance implications and risks than the earlier shotgun spray approach. No more “oops i accidentally put a print line INSIDE a for loop”. The write amplification profile is compressed. Most importantly, the incremental cost of capturing more detail about the request per service is nearly zero.

And now you have the kind of structured data that you can feed into something like a columnar store, or honeycomb, and run ad hoc queries to your heart’s delight.

Distributed systems logging events best practices:

Let’s sum up.  (I’m including links to other past rants on this topic):

Just think.

No more doing multi-line regexps trying to look for the same request ID or user ID doing five suspicious things in a row.

No more regexps at all, for fuck’s sake.

No more bullshit percentiles that were computed at write time by averaging over a bunch of other averages

No more having to jump around from dashboards to logs trying to vainly eyeball correlate one spike with another. No more wondering why no two tools can agree if anything even exists or not

Just gather the detail you need to ask the questions when you need them, and store it in a single source of truth.  It’s that simple.

No need to shame people from learning best practices that worked perfectly well for a long time.  You can either let them learn the hard way that this transformation is non optional, or you can help them learn the easy way that it’s simply much better and easier to invest in this telemetry up front.  You seem like a nice enough chap, which is probably why you chose door 2.  (If you wanted to get tougher about it, have a few reformed folks in to tell their horror stories.  Try some ex-twitter engineers.)

The hardest part seems to be getting people to unlearn all the best practices they once learned for dealing with logs.  So just don’t call it logs anymore, if that helps. Call it “structured events”.

– charity.

img_4817

Logs vs Structured Events