Skip to content

Introduce semantic conventions for modern AI (LLMs, vector databases, etc.) #327

@cartermp

Description

@cartermp

As many are likely aware, our entire industry is building with LLMs and their associated technologies.

For those of us who have built features with them, we have quickly realized that traditional debugging and testing of these system is impossible. You must rely on user input, LLM output, and dozens of other metadata fields associated with a request and/or pipeline (such as in a RAG system) to act as your guide for what's wrong, why it's wrong, what's worth fixing, and how to make sure your fix (either via prompting, fine-tuning, or both) didn't regress something that already worked.

In other words, Observability is essential to iterating with LLMs. But here's the catch: It's a complete mess right now how you actually capture that data.

At best, you've developed some internal standards and instrumented everything meticulously by hand. Or maybe you're using a tool like LangSmith or Baserun or Langfuse or ... you get the idea ... to understand the LLM, but you're still likely missing the greater picture (such as how a decision made upstream of an LLM request produces data that gets placed into the request, thus influencing the LLM's behavior). Or maybe you're logging stuff to a database and pulling up a google spreadsheet, scanning lines that represent a request by hand, trying to divine what happened. Or, worst of all, you're throwing your hands in the air, doing nothing, and praying that your prompt changes helped stuff.

The case for tracing

There's some attributes of LLM applications:

  • They deal in request information, whether that request is made to OpenAI or a self-hosted model you've fine-tuned
  • Latency is important for the end-user experience, and can be influenced by many factors such as the prompt you're using, the amount if output you're requesting, how you dynamically build a prompt at runtime, and more
  • Much of the relevant data involved is inherently of high cardinality - natural language inputs and outputs, version of prompts, etc.
  • The part of an application using an LLM is typically inherently connected with other systems, especially when there is a RAG pipeline involved
  • LLM calls can be composed together as chains where the output of one is fed as input to another
  • Agents that perform an unknown number of steps (or iterations) can run until an end state is reached and a final result is returned to a user
  • All of the above is mixed and matched

If you haven't already realized, this is a cookie-cutter case for distributed tracing, the most robust and stable signal we have in OpenTelemetry. And if you zoom out enough, there's not a whole lot unique about LLMs compared to other components, such as databases, which we have very well-established ways of instrumenting.

What we can do

In short, we can do the following:

  • Define semantic conventions for LLM operations and related technologies
  • Build instrumentation libraries that automatically instrument requests, client libraries, or even common frameworks
  • Use conventions to work with tool authors, framework authors, and LLM vendors such that their artifacts (some of which already have an internal tracing model) can output OTLP
  • Push this emerging side of software development to be interconnected and have data be portable across tools

Proposed semantic conventions

To start, I've begun creating semantic conventions in my own branch of this repository: https://github.com/cartermp/semantic-conventions/blob/cartermp/ai/docs/ai/README.md

Furthermore, there are three projects that can make use of these conventions starting today:

https://github.com/traceloop/openllmetry
https://github.com/cartermp/opentelemetry-instrument-openai-py
https://github.com/fxchen/opentelemetry-instrument-anthropic-py

Each offers an approach to automatic instrumentation that would immediately benefit from having established semantic conventions.

But who would do the work?!

As you can see, this is not a small lift. There are several open questions we'd need to resolve today (such if vector databases should live under the database semconv or not!), the LLMs space is fast-paced and we would need to keep up with it, and there's considerable work involved to collaborate with LLM vendors, framework authors, and tools authors to get things emitting and accepting OTLP traces .

I'm able to maintain this work, including via employer sponsorship. I would also like to get at least one other person involved in maintaining these semantic conventions.

But what about non-generative AI?!

Yeah, y'know, that stuff that's been around for decades before LLMs. There's likely still value in defining semantic conventions for the large variety of other ML systems, since there's existing systems and products using them today. I would submit that, while important, the sheer scale of generative AI in comparison dictates we start with generative AI, and in particular, LLMs and the constellation of technologies typically used with LLMs. Their adoption in our industry is so high that I'd rather start there.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions