[Proposal] Adding OpenTelemetry Trace Support to MCP #269

altryne · 2025-04-02T21:35:19Z

altryne
Apr 2, 2025

Pre-submission Checklist

I have verified this would not be more appropriate as a feature request in a specific repository
I have searched existing discussions to avoid duplicates

Your Idea

Abstract

This proposal outlines a mechanism to integrate OpenTelemetry (OTel) tracing capabilities into the Model Context Protocol (MCP). The goal is to address the "black box" nature of MCP server interactions within agentic workflows, enabling end-to-end observability for debugging, performance analysis, and system understanding. We propose adding a standardized way for MCP servers to emit OTel trace spans back to the calling MCP client, leveraging the existing protocol structure for notifications while adhering to open standards and maintaining semantic clarity between different observability signals.

Motivation

As MCP gains traction as the potential "HTTP for agents," the complexity of applications built upon it will increase. Agent developers rely heavily on observability tools (like Weights & Biases Weave, LangSmith, Braintrust, Arize Phoenix etc.) to understand and debug the flow of execution, especially within complex chains involving multiple tool calls and LLM interactions.

Currently, when an agent (MCP client) invokes a tool or resource via an MCP server, the internal operations of that server are opaque. The client only sees the request and the final response. This lack of visibility presents significant challenges:

Debugging Bottlenecks: It's difficult to determine if performance issues originate in the client logic or within the server's execution (e.g., slow API calls, inefficient processing).
Error Diagnosis: Errors returned by a server might lack sufficient context about where within the server's internal logic the failure occurred.
Understanding Complex Interactions: In multi-step agentic workflows involving multiple MCP calls, tracing the exact path and dependencies becomes difficult without insight into server-side operations.
Incomplete Observability: Observability platforms cannot provide a truly complete picture of the agent's execution trace.

This proposal aims to solve these issues by establishing a standard, OpenTelemetry-based mechanism for MCP servers to report their internal trace information back to the client, enabling full-stack observability for agentic applications. This aligns with the goal of making MCP a robust and production-ready protocol for the growing agent ecosystem.

Diagram

Proposal Details

We propose extending MCP to support the transmission of OpenTelemetry trace data from the server back to the client.

Standard: OpenTelemetry (OTel) will be the standard for representing observability data due to its vendor-neutrality, widespread adoption, and rich semantic conventions (including evolving conventions for GenAI). This proposal focuses initially on Traces. Support for Metrics could be considered in future proposals.
Transmission Mechanism: Trace data generated by the server during the execution of a client request (e.g., tools/call, resources/read) will be sent back to the client via MCP notifications. This aligns with MCP's existing mechanism for asynchronous server-to-client communication, such as the notifications/message used for logging. This approach is preferred over requiring servers to push data directly to an OTel collector because:
- It simplifies server implementation (no need to configure OTel exporters or handle authentication for external observability backends).
- It keeps control with the client, which can decide how and where to process or forward the received trace data (e.g., send to an O11ty platform such as Weave, Datadog, console, etc.).
- It leverages the existing MCP connection, avoiding additional network configuration.

New Notification Type (Rationale): OpenTelemetry defines three primary observability signals: Logs, Metrics, and Traces. MCP already supports structured logging via the notifications/message mechanism. While technically possible to overload this existing notification for trace data (see Alternative/Interim Mechanism below), we propose adding a new, dedicated notification type specifically for OTel trace data:

Method: notifications/otel/trace
Params:
- traceToken: The token provided by the client in the original request's _meta field. This MUST be included if the server is sending traces correlated to a specific client request that included a traceToken.
- resourceSpans: An array of OTel ResourceSpans objects, serialized according to the OTel OTLP/JSON format. (Exact schema details based on OTLP/JSON)

Justification for a Dedicated Type:

Semantic Clarity: Explicitly separates trace data from general application logs, making the protocol's intent clearer. This aligns with the conceptual separation of signals within OpenTelemetry itself.
Client Handling: Allows clients to easily identify and route OTel trace data to appropriate handlers or backends without needing to inspect the payload structure of generic log messages.
Future Extensibility: Provides a clear pattern if dedicated notifications for OTel Metrics (notifications/otel/metrics) or a richer OTel Log format (notifications/otel/logs) become desirable later.

Example Notification Payload (Conceptual OTLP/JSON):

{
  "jsonrpc": "2.0",
  "method": "notifications/otel/trace",
  "params": {
    "traceToken": "client-req-abc-789", // Echoed from the originating request
    "resourceSpans": [
      {
        "resource": { /* OTel Resource Attributes */ },
        "scopeSpans": [
          {
            "scope": { /* OTel InstrumentationScope Attributes */ },
            "spans": [
              {
                "traceId": "a1b2c3d4...", // Hex encoded
                "spanId": "e5f6a7b8...",  // Hex encoded
                "parentSpanId": "c9d0e1f2...", // Optional, Hex encoded
                "name": "internal_api_call",
                "kind": "SPAN_KIND_CLIENT",
                "startTimeUnixNano": "1678886400123456789",
                "endTimeUnixNano": "1678886400987654321",
                "attributes": [ /* OTel Attributes */ ],
                "status": { /* OTel Status */ }
                // ... other OTel Span fields
              }
              // ... more spans from this scope
            ]
          }
          // ... more scopeSpans
        ]
      }
      // ... more resourceSpans (though likely just one per notification)
    ]
  }
}

Alternative/Interim Mechanism (using Logging): As mentioned, MCP currently supports structured logging via notifications/message. It is technically possible to transmit OTel span data using this existing mechanism by encoding the span data within the data field, perhaps using a specific logger name (e.g., otel_trace).
- Pros: Requires no immediate protocol schema change.
- Cons: Semantically overloads the logging mechanism, less discoverable, harder for clients to specifically handle trace data distinct from regular logs, lacks clear standardization for trace data structure within the log payload.
- Recommendation: While feasible for prototyping or demonstrating value quickly, we strongly advocate for the dedicated notifications/otel/trace mechanism for long-term clarity, standardization, and alignment with OTel principles.
Trace Correlation & Stitching (Using traceToken):
- To reliably associate server-side trace notifications with the specific client request that triggered them, especially in concurrent scenarios, we propose to adapt the existing progressToken pattern.
- A client wishing to receive correlated trace data for a specific request (e.g., tools/call, resources/read) MUST include a traceToken field within the _meta object of that request's params.
  - The traceToken MUST be a string or number.
  - The client MUST ensure the traceToken is unique among its currently active requests for which it expects correlated trace data.
- Example Client Request with traceToken:
```
{
  "jsonrpc": "2.0",
  "id": 123,
  "method": "tools/call",
  "params": {
    "_meta": {
      "traceToken": "client-req-abc-789" // Client-generated unique token
      // "progressToken": "client-prog-xyz-123" // Could also exist
    },
    "name": "my_tool",
    "arguments": { /* ... */ }
  }
}
```
- If a server supports the otel.traces capability and receives a request containing a traceToken, it SHOULD attempt to generate and send trace data related to that request's execution.
- Any notifications/otel/trace messages sent by the server that directly result from processing that specific request MUST include the identical traceToken value in their params.
- The client uses the received traceToken in the notification to unambiguously associate the contained OTel spans with the correct originating client request and its corresponding client-side span.
- If a client does not include a traceToken in its request, a server supporting otel.traces MAY still emit trace notifications (e.g., for background server activity), but these notifications MUST NOT include a traceToken and cannot be directly correlated by the client using this mechanism.

New Server Capability: Servers supporting this feature MUST declare a new capability during initialization:

{
  "capabilities": {
    // ... other capabilities
    "otel": {
       "traces": true // Indicates support for sending trace notifications
       // "metrics": false, // Future placeholder
       // "logs": false     // Future placeholder
    }
  }
}

Clients can check for capabilities.otel.traces === true to know if a server might send these notifications.

Schema Changes

Add the following to the definitions section of the MCP JSON schema:

// Within ServerCapabilities definition:
"otel": {
  "description": "Present if the server supports sending OpenTelemetry data.",
  "properties": {
    "traces": {
      "description": "Whether this server supports sending OTel trace data via notifications/otel/trace.",
      "type": "boolean"
    }
    // Potentially add metrics/logs booleans here later
  },
  "type": "object"
}

// New Notification Type definition:
"OtelTraceNotification": {
  "description": "Notification carrying OpenTelemetry trace span data from the server to the client.",
  "properties": {
    "method": {
      "const": "notifications/otel/trace",
      "type": "string"
    },
    "params": {
      "description": "Payload conforming to OTLP/JSON Trace format (specifically the ResourceSpans structure). See <https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto>",
      "type": "object",
      // Define structure based on OTel OTLP/JSON spec for ResourceSpans[]
      // Example placeholder structure (full definition is extensive):
      "properties": {
         "resourceSpans": {
            "type": "array",
            "items": { "$ref": "#/definitions/OTLPResourceSpans" } // Reference a detailed definition
         }
      },
      "required": ["resourceSpans"]
    }
  },
  "required": [
    "method",
    "params"
  ],
  "type": "object"
}

// Add OtelTraceNotification to the ServerNotification union:
"ServerNotification": {
  "anyOf": [
    // ... existing notifications (Cancelled, Progress, LoggingMessage, ResourceUpdated, etc.)
    { "$ref": "#/definitions/OtelTraceNotification" }
  ]
}

// Placeholder definition for OTLP ResourceSpans structure (should be fully fleshed out based on spec)
"OTLPResourceSpans": {
  "type": "object",
  "properties": {
      "resource": { "type": "object" }, // OTel Resource definition
      "scopeSpans": {
          "type": "array",
          "items": { "$ref": "#/definitions/OTLPScopeSpans" } // OTel ScopeSpans definition
      },
      "schemaUrl": { "type": "string" }
   }
}
// ... further detailed definitions for OTLPScopeSpans, OTLPSpan, OTLPAttribute, OTLPStatus etc. following OTel spec would be required here ...

// Within JSONRPCRequest definition, inside params._meta.properties:
"traceToken": {
  "$ref": "#/definitions/TraceToken", // Or reuse ProgressToken if defined identically
  "description": "If specified, the client is requesting correlated OpenTelemetry trace notifications for this request (via notifications/otel/trace). The value is an opaque token generated by the client that MUST be unique among its active requests. The server MUST include this token in any resulting notifications/otel/trace messages."
}

// Define TraceToken (likely identical to ProgressToken):
"TraceToken": {
  "description": "An opaque token provided by the client in a request to correlate subsequent trace notifications.",
  "type": [ "string", "integer" ]
}

// Within OtelTraceNotification definition, inside params:
"traceToken": {
  "$ref": "#/definitions/TraceToken", // Or reuse ProgressToken
  "description": "The correlation token provided by the client in the originating request's _meta field. Required if this notification pertains to a specific client request that included a traceToken."
  // Note: Making this field technically optional in the schema might be necessary
  // to allow for server-initiated traces not tied to a specific request,
  // but the *protocol rule* is that it MUST be present if correlated.
  // Alternatively, define two notification types: correlated and uncorrelated.
  // Let's keep it simple for now and assume it MUST be present if correlated.
}

// Ensure traceToken is required in the OtelTraceNotification params if adopting the simpler approach
// Within OtelTraceNotification definition:
"required": [
    "method",
    "params" // params itself is required
],
// Within OtelTraceNotification.params definition:
"required": [
    "resourceSpans",
    "traceToken" // Make the token required *within* params
]

(Note: Fully defining the OTLP/JSON structure within the MCP schema might be overly verbose. Alternatively, the schema could simply state params is an object conforming to the OTLP/JSON ResourceSpans specification and link to it.)

Use Case / User Story

As Sarah, an Agent Developer, I'm debugging my customer support agent. It uses an MCP-based Notion tool to fetch KB articles. Users report intermittent slowness.

Without MCP Observability: My Weave trace shows a long duration for the tools/call to the Notion tool, but I don't know why it's slow. Is it network latency to the tool? Slow database queries within the tool? An inefficient internal function?

With MCP Observability (this proposal):

I ensure the Notion MCP server supports and advertises the otel.traces capability.
During the slow tools/call, the Notion server generates internal OTel spans (e.g., notion_api_request, process_results).
The Notion server sends these spans back to my agent client via notifications/otel/trace.
My Weave SDK (running in the client) receives these notifications.
In the Weave UI, the trace for my agent's tools/call now expands to show the nested server-side spans received from the Notion tool.
I can clearly see that the notion_api_request span within the server took 90% of the time.
Diagnosis: The bottleneck is the Notion API itself, not the MCP server's logic or the network between my client and the server. I can now focus my efforts on optimizing Notion API usage or contacting Notion support.

Security Considerations

Data Sensitivity: Trace data (span names, attributes) can potentially contain sensitive information. Server implementers MUST be responsible for sanitizing span data before sending it, respecting user privacy and data security policies. Clients receiving trace data SHOULD NOT assume it is pre-sanitized.
Client Control: The client remains in control of where the received trace data is ultimately sent. Servers do not gain direct access to the client's observability backend.
Resource Usage: Servers should be mindful of the overhead of generating and transmitting trace data. Clients might need mechanisms (potentially in future proposals) to signal desired verbosity levels if performance becomes an issue. Rate limiting on the server for sending notifications is advisable.

Backwards Compatibility

This proposal is additive:

Clients and Servers not supporting the otel.traces capability will ignore the related messages and capability flags.
Existing functionality remains unchanged.
Servers can incrementally adopt this feature.

Alternatives Considered

Server-Side Push: Requiring servers to directly push OTel data to a collector.
- Rejected: Adds significant complexity to server implementation (exporter config, auth handling), removes client control over data destination.
Using Only Existing Logging: Transmitting OTel trace data via the existing notifications/message.
- Acknowledged: MCP already supports logging via notifications, and logging is an OTel pillar.
- Rejected as primary mechanism: Semantically overloads the logging mechanism specifically for traces, making it harder for clients to distinguish and route trace data compared to general logs. It hinders clear standardization of the trace payload format within the generic log structure and doesn't align as well with the conceptual separation of signals in OTel. A dedicated notifications/otel/trace provides better long-term clarity and structure. Considered viable only as a temporary workaround or for initial proofs-of-concept.
Client-to-Server Trace Context Propagation: Modifying request formats to include headers like W3C TraceContext.
- Deferred: Adds complexity to the current proposal. While potentially useful, the client-side stitching enabled by receiving server spans is sufficient for the primary goal. Can be revisited in future proposals if deemed necessary.

Open Questions / Future Work

Finalizing the exact OTel serialization format for notifications/otel/trace (OTLP/JSON recommended). Confirming full schema definition vs. referencing external spec.
Defining standard OTel semantic conventions specifically for MCP operations (e.g., standard attributes for tools/call, resources/read).
Extending this framework to support OTel Metrics and Logs via potentially new notification types (notifications/otel/metrics, notifications/otel/logs).
Mechanisms for clients to request specific trace verbosity levels from servers.

References

OpenTelemetry Specification: https://opentelemetry.io/docs/specs/otel/
OTLP Specification (including JSON): https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md
OTel Proto Definitions (Trace): https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto
MCP Specification (Lifecycle, Capabilities, Logging)
Existing GitHub Discussion on OTel: Include OpenTelemetry Trace identifiers as part of the MCP client -> server protocol #246 , Improvements for MCP-based agents #111 (comment)

Scope

bmorphism · 2025-04-03T16:41:37Z

bmorphism
Apr 3, 2025

observed 👀

0 replies

Kevandrew · 2025-04-03T20:13:21Z

Kevandrew
Apr 3, 2025

Hi @altryne,

Solid proposal! Adding standardized OTel tracing is definitely the right move for production MCP.

Building the Ithena Governance SDK (@ithena-one/mcp-governance) (GitHub) – which adds auth, auditing, etc., around the base server – really highlighted this observability blind spot. Debugging gets way easier when you can see the full picture.

My thoughts on the specifics:

OTel Standard & Server -> Client Notifications: Makes total sense. Keeps server implementation simpler and client observability flexible.
Dedicated notifications/otel/trace: Agree 100%. Overloading notifications/message would be messy. Clear separation for traces is much better for handling on the client side.
Trace Sanitization: This is critical and probably harder than it looks. My SDK includes a sanitizeForAudit hook precisely because this is a hard problem even just for audit logs. Standardizing guidance here would be a huge win for security.
traceToken Correlation: Yep, essential for async correlation, like progressToken.

This fits perfectly with the approach in the Ithena SDK. We already handle propagating the incoming trace context (using TraceContextProvider, default W3C) and embedding it throughout our structured logs and audit records. The missing piece is getting those internal server spans back. Seeing this proposal, adding the OTel integration to consume and correlate these notifications/otel/trace messages is definitely going onto the Ithena SDK roadmap. It would allow users to easily stitch that full trace together: client -> Ithena governance steps -> server handler internals.

Really valuable direction for the protocol. Happy to share insights from the governance/observability layer perspective as this moves forward.

0 replies

samsp-msft · 2025-04-09T05:00:53Z

samsp-msft
Apr 9, 2025

This takes a different approach to telemetry than is typical for OpenTelemetry where each node sends its telemetry to the back end via its own export mechanism. Why should MCP be any different from other distributed systems such as http API servers where that system is the norm. It also has the potential to expose internal implementation details which can be both security and competitive concerns as part of the output.

I agree that MCP needs telemetry support, but I look at it as another RPC mechanism and so should be modelled as such.

What I think is missing is:
a) A standard way to pass trace context between clients and servers: #246
b) Semantic conventions for the telemetry produced so that monitoring can be common, regardless of implementation. See open-telemetry/semantic-conventions#2045 as a starting point.

The C# SDK has a first round of OTel support: modelcontextprotocol/csharp-sdk#183

7 replies

altryne Apr 9, 2025
Author

@lmolkova thanks, I think examples would be great!

Though I think I've miscommunicated the hurdle. Coordinating where to send the traces, between a developer of a client, and the servers that implemented o11ty, is contributing to the hardness. There's already a great deal of auth complications (eg, users don't want to send their API keys directly to clients) so adding another point of configuration sharing between clients and servers may be detrimental to adoption of observability by server developers.

We've modeled the proposal based on the existing logs system in MCP and based on the sampling concept, in both, the server doesn't have to configure where and how to send the logs, or to authenticate into another LLM (in the case of sampling), but rather just let the client take care of that and pass that information to the client.

The whole idea stems from, MCP servers are tools, and tool calls are part of the client execution chain, so the client should be aware of what's happening (similar to logs that already exist)

dmontagu Apr 9, 2025

@lmolkova the obvious issue to me with the MCP server taking responsibility for the exports is that, unless I'm misunderstanding, you would need to use a separate MCP server for each separate OTLP export configuration you wanted.

It seems reasonable to me in principle to want to run a single instance of an MCP server for use by multiple distinct applications (presumably over SSE; CLI ones make sense to just run a separate instance for each application) and share it across different OTLP configurations (e.g., share it between stacks sending to different endpoints); by just passing the spans from the server to the client for processing on the client, the server can be fully agnostic to what the client wants to do with the spans.

I know MCP doesn't have a good auth story yet but this issue is significantly exacerbated if/when there is any form of "hosted" MCP server — even if there was a secure auth story, I wouldn't necessarily want to provide a third party with the necessary credentials details to send OTLP data to my collector. And in some cases wouldn't even be able to thanks to VPCs, etc.

But even if you ignore third-party or even cross-OTLP-destination MCP servers, it just feels nice if I don't need to worry about maintaining the OTLP configuration for each of the MCP servers my client app is using, and can just worry about maintaining it in one place — in my client app.

@altryne I find the point in the final sentence of your last comment to be a good summary of my own thinking on the topic.

dmontagu Apr 9, 2025

@lmolkova Thinking about it more — I guess I would expect you to argue that, more or less, if the MCP server can't send spans to my collector, then I should be just treating its behavior as a black box from the perspective of telemetry anyway. And maybe that's a strong enough argument.

I do see the value of getting more detailed tracing information back from a(n otherwise opaque) server I'm making a request to though. But this is also clearly not an MCP-specific thing (even if it is frequently useful in the modern world of LLM tools, etc.), and as I think @alexmojaki noted below it may make more sense as a broader extension of OTel.

(I personally think that externally hosted MCP servers are a good motivating use case for this extension, but see why it might be problematic to jump straight to implementing it.)

anuraaga Apr 10, 2025

Hi @dmontagu I happened to find this thread - I think one thing to keep in mind is for any client SDK, there will be a client span as well. At a minimum this will have the overall duration, but it can be enriched with any data returned from the server, be it as span events, logs, metrics, etc, without creating additional spans. This is similar to the approach taken by LLM SDK instrumentation such as openai instrumentation, where the usage field of the response is used for enriching the telemetry while openai is naturally a black box that wouldn't want to report detailed information. Instead if connecting to a self-hosted LLM via vLLM for example, it could additionally export traces directly for more detailed telemetry on the server side. For MCP it seems similar, so I think you hit the nail on the head with "if the MCP server can't send spans to my collector, then I should be just treating its behavior as a black box from the perspective of telemetry anyway".

It can be useful to standardize data to send from the server to client to enrich the client span within its own instrumentation, while tracing within the server could be considered just a normal OpenTelemetry instrumentation task. Definitely this means handling propagation though and I think that's where @codefromthecrypt's comment below comes from, starting with only propagation and seeing how far existing OpenTelemetry instrumentation gets us on the server-side seems like a good step.

lmolkova Apr 10, 2025

I can see how reporting one spans about server processing one request could work.

But server observability involves way more than one span. E.g. Azure MCP server would call azure services and we'd report a potentially large tree of spans originating somewhere on the client and continuing downstream inside Azure.

With the MCP server not properly participating in distributed tracing (not propagating context), it breaks end-to-end traces.

Assuming this is solved, the server tool author would need to manually handcraft spans they want to return back to the client and fake context propagation based on these spans. Auto-instrumentation won't work since it's usually global and you'd need to route to specific session from exporter (🤯).

So in the big picture, it'd be extremely hard to implement and realistically we'd have one more layer of spans available on the clients and still the rest of the server will remain a black box.

I'd be happy to learn I'm wrong, but my guess that there is no end-to-end prototype for this.

I can share mine and I don't think we can handcraft OTLP spans with this beautiful distributed recursion

[Update]

Here's a plain OTel instrumentation in .NET SDK: modelcontextprotocol/csharp-sdk#262 based on pre-existing instrumentation and semconv prototype including naive proposal for context propagation (open-telemetry/semantic-conventions#2083)

lmolkova · 2025-04-09T14:12:43Z

lmolkova
Apr 9, 2025

Speaking from OpenTelemetry side, there is no need to wrap OTLP into MCP, just instrument the server and let it send OTLP to user-specified endpoint.

4 replies

altryne Apr 9, 2025
Author

Thanks Luidmila for the comments,
Though, I'm confused who's the "user" in your response?

samuelcolvin Apr 9, 2025

I assume @lmolkova means the person running that server - either dev ops person or individual running it on their laptop.

lmolkova Apr 9, 2025

user is someone who runs the server.

Let's imagine two possible ways servers can follow:

common server SDK provides instrumentation (using OTel API) that's noop by default. Specific server implementations can add additional stuff.
- someone who runs the server will need to either modify the server code or inject OTel SDK in a codeless way to start exporting this data. That's not always possible, but may be a great first step.
- it could make sense to leverage existing otel instrumentations and integrations. E.g. postgres MCP server can enable OTel postgres instrumentation and it will instrument client calls from server to the DB.
common server SDK provides instrumentation via OTel API and also code to setup otel exporters that servers can just enable. OTel is pretty configurable with environment variables and configuration file
- someone who runs the server would provide env var like OTEL_EXPORTER_OTLP_ENDPOINT and point it to local OTel collector that would deal with any additional tasks like post-processing or auth (this is a common pattern in otel). They can point to their provider remote endpoint and provide some basic auth with config/env vars.
- there should be an explicit gesture done by users to enable otel - i.e. no tracing happens by default.

The p2 can be an evolution of p1 - providing common otel setup code and, when necessary, convenience for specific servers to just enable default otel.

Either way, server would send traces to OTLP endpoint directly.

If server returned traces back to the client, then clients would have a really hard time converting those into proper OTel spans.
It's probably doable, but the telemetry reported this way needs custom solution for configuration, processing, correlation. OTel instrumentation is intended to be done via OTel API and there are a lot of things that happen before the data is converted to OTLP.

aabmass Apr 9, 2025

For something completely user managed, I completely agree. The server doing more than just handling MCP requests. If the operator wants to capture language runtime metrics or trace background tasks, they need to set up an SDK and exporters anyway.

Likewise if you don't do context propagation over HTTP, the HTTP layer tracing will be broken.

alexmojaki · 2025-04-09T17:39:56Z

alexmojaki
Apr 9, 2025

We propose extending MCP to support the transmission of OpenTelemetry trace data from the server back to the client.

This should be the title. The current title makes it sound much broader.

@samsp-msft and @lmolkova are right. #246 in particular is needed even in this proposal, spans still need the right context even if the client is responsible for sending them to their 'final' destination. With that, nothing else is needed to have full observability. Configuring OTel exporters in the servers isn't that hard.

I can see how it could be useful for the SDKs to take care of exporter configuration so that you only need to configure once in the client. It's also possible that you want to export spans somewhere that's inaccessible to the servers, e.g. you may have a locally hosted OTel backend and remote servers. But it needs to be clearer that this is what's being proposed and why.

There are probably also other ways to achieve the same goal. I can imagine generic OTel components that could be reused in other similar scenarios instead of building something specific to MCP. You're basically just using the MCP client as an OTel collector/proxy.

3 replies

altryne Apr 9, 2025
Author

Thanks @alexmojaki

We've of course reviewed #246 and I agree that context is needed! In the proposal above, the traceToken will serve as that function, tying together the parent spans from the client and the incoming child spans from the server. Will this not be enough for the client to get the right context and parent/child relationship?

Configuring OTel exporters in the servers isn't that hard.
It's still a barrier to entry to most developers unfamiliar with OTel, but regardless of how hard the configuration is, the harder part that we've tried to propose, is coordinating between one client and several servers where to send the traces and how to stitch them together.

As you've rightly stated, the client developer may want to send the traces somewhere internal, where the servers have no access (in Weave's case, an on prep installation of Weave, which many of our Enterprise clients prefer)

In the case of a hosted registry of MCP servers like MCP.run, they may want to pass the spans back to the using client somehow. Servers will not and should not have that information.

I conceptually like your framing of MCP Client as an OTel proxy, though I worry that this will not sound great to OTel folks, as they consider the client another node.

dmontagu Apr 9, 2025

@alexmojaki agreed this is not MCP-specific, and feels like it might be better as an extension of OTel (i.e., an API for providing partial observability back to a requesting client).

For what it's worth, @altryne I personally find all your arguments for this design (i.e., treating the MCP client as an OTel proxy) to be well motivated and compelling, though I do also see why it's potentially problematic from a security perspective (as @samuelcolvin noted below) to have hosted tools just sending their tracing data out to potentially-unfriendly clients. Even if the only way this ends up being a problem is a mistake/bug in the server implementation, it wouldn't be a surprising mistake.

My guess is that any approach that allows a server to treat the client as an OTel proxy/collector would merit more significant design debate/iteration to make it hard for developers to accidentally reveal implementation details they wanted to keep private, etc.

And the cases where those potential security consequences are not an issue happen to basically coincide with the cases where the MCP server really is controlled by the client — and in that case, there's generally fewer obstructions to just using the "standard" approach to OTLP suggested by @lmolkova et al elsewhere. (I guess @samuelcolvin came to the same conclusion at the end of his comment below.)

Cirilla-zmh Apr 10, 2025

My guess is that any approach that allows a server to treat the client as an OTel proxy/collector would merit more significant design debate/iteration to make it hard for developers to accidentally reveal implementation details they wanted to keep private, etc.

To some extent, I support your viewpoint, as this would indeed make the entire trace more complete—especially for users who rely heavily on external OpenAPIs.

However, if we proceed with this approach, we will have to face the following challenges:

Response body size explosion — The trace data from the server could be larger than the response body itself.
Increased maintenance difficulty for telemetry developers — Developers will need to consider how to export trace data from the server to avoid generating duplicate spans, which will be reflected in the implementation of instrumentation on the client side. This can also confuse Ops.
Trace distortion — When the server generates trace data according to its own preferences and decides what content to return (e.g., RT, errors, etc.), external server traces may become more prone to misleading the client.

In fact, I don’t think it would cause significant problems if the trace data from external MCP servers couldn’t be exported to the client. When something goes wrong, users can investigate the issue by contacting the server provider with the traceId, which is a natural workflow.

Overall, I believe it’s not yet the right time to discuss all these issues. Before doing so, it might be a better choice to first introduce a proper trace context carrier for MCP and enable both MCP Server and Client to independently generate traces.

samuelcolvin · 2025-04-09T20:11:40Z

samuelcolvin
Apr 9, 2025

I thought about this approach too, but I agree with @samsp-msft that it seems like a mistake.

The security implications are far too severe - using this approach any otel instrumentation that emits traces while processing MCP requests would result in that data being sent to the client by default - GitHub are not going to risk details of the SQL queries they run within their MCP server being send to every client that connects. In a hypothetical future where Bank of America have an MCP server for their clients, they're going to be even more fearful. Of course for now people are building MCP servers as separate services which mostly connect to their main platform API, so these traces are less sensitive, but also less useful.
This is a big departure from OTel and I think other tracing systems, where each service is responsible for sending their traces to the collector
In a situation where many MCP client+servers are chained together, you could end up with very significant data volumes at the original client, which might not have the bandwidth to transfer all that data to the OTel backend.
You risk data being duplicated if spans are sent both from the MCP server and the client. If you introduce systems to avoid duplication and only send from one, you risk misconfiguration resulting in neither service sending the traces. Same for metric duplication.
There's a relatively clear role for logs in MCP: "things the MCP client might want to know about" I think this use of standard OTel tracing to send data up to the client could introduce significant confusion. Example: "I want to see spans related to my SQL queries in the observability platform, but I don't want those SQL queries anywhere near my end users screen or the model's context - how do I get those traces to Logfire (or equivalent?)". I'm sure you have answers for these specifics, or we could find answers but it's added cognitive overhead at the very least that we don't need.

Overall I think there are two cases:

the MCP server is part of the same trust network (company, VPC, whatever) as the client, in which case they will use the same otel backend and the standard system of telemetry would work fine
or, the MCP server and client are managed by different organizations, in which case the security risks are too great for the server to send traces to the client

Let's concentrate on #246 and making logs better if required as a first step.

3 replies

altryne Apr 9, 2025
Author

Hey Samuel, thanks for jumping in so quick! Let me try and see if I get your points correctly:

using this approach any otel instrumentation that emits traces while processing MCP requests would result in that data beening sent to the client by default

Given the traceToken suggestion, spans will only be sent if originated from a client call with the same traceToken. I generally agree with the "not every MCP server will send traces" but I think that if implemented into the spec, servers can choose to be observable (maybe even for security) and developers will decide which of the servers they will use, based on, among other things, whether that server is observable.

This is a big departure from OTel and I think other tracing systems, where each service is responsible for sending their traces to the collector

See my reply to @samsp-msft, while some MCP servers are indeed API wrappers, not all of them are. Though, after seeing @alexmojaki response, it may be helpful to think of this proposal as the MCP client being the Otel sync/proxy for the servers it uses.

You risk data being duplicated if spans are sent both from the MCP server and the client

Not sure how? The proposal is to not have the servers send any traces anywhere, besides the client, and have the client take care of that. Additionally, the traceToken in the proposal will directly tie the spans arriving from servers to parent traces in called by the client.

I think this use of standard OTel tracing to send data up to the client could introduce significant confusion

I admit, I am confused by your confusion and the example you gave. We're not suggesting using logs for this, rather a similar primitive, specifically for traces. They will not be exposed to the user or the model context in any way (which is a differentiating factor from #246 which I believe suggests adding parent context to the raw JSON response, at least in the case of the STDIO transport)

In the case of Logfire, or Weave, or Braintrust or other platforms, it's actually quite simple. A developer initiates the logfire/weave/brainturst SDK on the client. MCP servers that support this proposal will send Otel spans via the proposed otel.traces capability back to the client. The Client SDK will send those to the observability platform.

dmontagu Apr 9, 2025

@altryne I agree that the "You risk data being duplicated" is not a real problem, and the "this use of standard OTel tracing to send data up to the client could introduce significant confusion" is also not a real problem.

I think Samuel's other bullets though are more real problems — in roughly descending order of importance, security implications (i.e., accidentally proxying spans with sensitive details), departure from behavior of current usage of OTel, and significant data volumes at client.

samuelcolvin Apr 9, 2025

Another issue I missed:

OTel is moving to use log events to capture some information that is closed related to tracing: specifically llm messages and exceptions, but likely other things too. How should these be handled?

If we sending them over traces, we're departing from OTel even more, if we're sending them as otel logs, the protocol has already got much more complicated, if we're doing neither spans won't be that useful.

codefromthecrypt · 2025-04-09T22:59:14Z

codefromthecrypt
Apr 9, 2025

TL;DR; focus first on propagation. release that and get experience with it. then, narrow otel bits possibly in the otel org.

My 2p is focus as a start on propagation and don't assume a specific instrumentation approach or data layout, or constrain to a specific signal like trace, metrics or logging.

In other words, make it first possible to correlate/join a trace. this solves the most important part and other things can follow after practice.

For example, what's currently called traceToken, you can use that field or add headers, or add a specific field for w3c traceparent header. An instrumented SDK can then inject and extract the trace context from that header, placing it as the current span. Not only does this keep things simpler, but it also provides a path for anything that supports the w3c propagation spec, but isn't strictly otel. This would include other open source projects like zipkin or vendor SDKs.

This inject/extract part has limited overhead and also doesn't require a specific model to be applied. The highest impact is you can achieve the same trace today when converting a local function to one split over stdio.

Note: We don't need MCP clients to become otel collectors as with some configuration, stdio subprocesses can inherit the same auto-instrumentation as the parent, either directly or implicitly.

For example, if you run node like this, the resulting subprocess will get the same auto-instrumentation hooks as the parent.

# using elastic distribution of otel, but I think this works with normal also
node --env-file .env -r @elastic/opentelemetry-node index.js

// MCP server is a subprocess, and we want all arguments given to node to
// be visible, notably anything like '-r @elastic/opentelemetry-node'.
const transport = new StdioClientTransport({
    command: process.execPath,
    args: [...process.execArgv, ...process.argv.slice(1), SERVER_ARG],
});

0 replies

praveenhub · 2025-04-09T23:33:38Z

praveenhub
Apr 9, 2025

Hey @altryne, just checked out your OTel tracing proposal for MCP. Interesting approach to have servers send trace spans back to clients! I've been implementing MCP tools with observability lately, and this could really solve the "black box" problem we're all facing.

I see both sides of this debate. On one hand, the standard OTel approach where each component exports its own telemetry makes sense for traditional distributed systems. But for MCP, where servers are essentially "tool calls" in a client's workflow, having the client stitch together the full trace feels more natural.

The concern about exposing internal details is valid, but servers could control what spans they expose. I'm thinking this could start with basic timing data and gradually add more detail as needed.

Have you considered a hybrid approach? Maybe keep the context propagation from #246 but also allow this optional span return mechanism for servers that want deeper integration with client observability tools?

Either way, getting proper tracing into MCP is crucial as we build more complex agentic workflows. Nice to see this getting attention!

1 reply

karthikscale3 Apr 10, 2025

The concern about exposing internal details is valid, but servers could control what spans they expose. I'm thinking this could start with basic timing data and gradually add more detail as needed

+1 to this. I think the server developer/owner could control what spans they want to expose.

I think exposing a env var for setting up a collector would be a viable medium term approach instead of sending traces back on the MCP layer to the client. My worry is, in a scenario where we have a network of MCP servers like a DAG where the client to a MCP server could be another MCP server, the trace payload will be propagated back and forth between these servers which could increase the traffic flowing on the network. Instead, if each MCP server in this network exported traces to a common sink that the user determins, it could lead to a much cleaner design.

shykes · 2025-04-10T08:48:06Z

shykes
Apr 10, 2025

Hi all, for reference Dagger implements both MCP and OTEL, for full observability of tools. It works great and required no extension of either protocol.

IMO if you want a unified trace across LLM and tools (to see the context around the tool call) then you should unify your observability stack across MCP client and server. This can be done at the runtime layer: either by literally having the same runtime on both sides (eg. agent SDKs and frameworks); by executing stdio mcp servers with injected otel collector configuration; or by configuring your mcp clients & remote mcp servers to send to the same otel collectors, then reconcile context when rendering traces.

1 reply

codefromthecrypt Apr 10, 2025

nice to see you again. So, are you saying dagger is solving the boundary divide (propagation break) in a novel way, or are you saying the broken traces could be reconciled on a buffering/transforming collector, or something else? maybe a more specific link to your instrumentation approach as I think if you have a solution to broken traces without a custom protocol, it is worth sharing!

vincentkoc · 2025-04-10T10:35:49Z

vincentkoc
Apr 10, 2025

(Placeholder) will be writing a response to this in the next 24hours on behalf of Comet. We welcome OTEL support but need to be concious if the mechanics. Thanks for bringing this up 🙏

0 replies

Dney · 2025-04-10T17:00:06Z

Dney
Apr 10, 2025

Clean core type definitions. Looks like a solid foundation for the protocol. Great direction.

0 replies

LucaButBoring · 2025-04-10T17:41:33Z

LucaButBoring
Apr 10, 2025

There's already been plenty of discussion on security concerns - I'll mostly leave that to the respective comment threads. What I'd like to add on top of that is that having the client aggregate and emit traces from servers seems unnecessary compared to standard OTel patterns, if servers are already propagating trace IDs for association purposes (the focus of #246).

This proposal makes some amount of sense if the end-user happens to also control all of the MCP servers in use, which is more or less the status quo. I don't believe that will continue to be the case in the future, however - clients and servers will regularly interact with peers outside of their trust boundary that they do not want to share raw traces with, and they may still want observability relative to other nodes within their trust boundary. What this proposal would force servers to do is to make a binary decision to either emit or not emit spans to all clients, and make clients responsible for sending those aggregated spans to a collector.

Instead, if #246 is adopted, servers will have to individually send spans to a collector (the usual OTel pattern), which makes more sense when the end-user does not control all MCP servers in use. Each server or client owner would be responsible for sending spans to their own collector, ensuring that data isn't inadvertently leaked.

The key detail I want to highlight here is that when the end-user does own all servers they're using, these two models behave in the same way. The end-user will still have their own span collector that they would individually point all of their servers to, enabling the same degree of observability at the cost of slightly more configuration (an extra environment variable on every server, perhaps).

0 replies

LucaButBoring · 2025-04-10T17:57:16Z

LucaButBoring
Apr 10, 2025

Data Sensitivity: Trace data (span names, attributes) can potentially contain sensitive information. Server implementers MUST be responsible for sanitizing span data before sending it, respecting user privacy and data security policies. Clients receiving trace data SHOULD NOT assume it is pre-sanitized.

I think this is a bit of a handwave for two reasons:

A server may have different policies inside and outside of its trust boundary. How could a server include internals in its trace spans to trace collectors within its trust boundary while excluding them from spans sent to clients?
"user privacy and data security policies" will vary between servers hosted within a single trust boundary, if developed by different vendors. If Make the MCP Protocol support MCP <-> MCP communication, allow AI to chain MCP execution flows #215 (server-to-server communication) is implemented and an intermediate server (also acting as a client for upstream servers) has more- or less-permissive policies than its peers, how can any party involved reconcile that? The intermediate server doesn't know what the upstream traces mean, so if it has less-permissive policies, how can it know what data to scrub?

I can easily see this turning into a situation where many MCP servers will wind up having two parallel OTel systems for sending spans to clients versus internal collectors to reconcile those differences. Not only would that be error-prone for server implementors, it (again) raises the question of what value this proposal adds over #246.

1 reply

LucaButBoring Apr 10, 2025

My two comments admittedly overlap, but I really want to try and separate the question of "is this necessary" (previous comment) from "what are the security implications" (this one).

narengogi · 2025-12-09T11:12:41Z

narengogi
Dec 9, 2025

traceparent and baggage is the way to go here
not sure it should be part of the spec tho, it's a good to have

1 reply

codefromthecrypt Dec 9, 2025

Tldr I think decoupling of telemetry channel from app is a feature not a bug. Propagation is a special case and a small amount of inline data isnt very dangerous. Most designs use a dapper paper guide which is that the big data ... keep it out of the app protocol.

True to brand.. here's my long version:

In ACP i also recommended separate concerns. Propagation hints is one thing. Otlp support is another. However this is different from defining message types and a new transport just for json-rpc. I don't currently think doing that is a great idea. Head of line blocking etc as well splitting the ecosystem. The alternative of coordination is a better punch and consolidates practice where it is (deliberate side channel) vs trying to move it where it isnt (custom json-rpc interwoven with app messages and could block or crash them) agentclientprotocol/agent-client-protocol#298

[Proposal] Adding OpenTelemetry Trace Support to MCP #269

Uh oh!

Pre-submission Checklist

Your Idea

Abstract

Motivation

Diagram

Proposal Details

Schema Changes

Use Case / User Story

Security Considerations

Backwards Compatibility

Alternatives Considered

Open Questions / Future Work

References

Scope

Replies: 14 comments · 21 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

altryne Apr 9, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

altryne Apr 9, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

altryne Apr 9, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

altryne Apr 9, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 14 comments 21 replies

altryne Apr 9, 2025
Author

altryne Apr 9, 2025
Author

altryne Apr 9, 2025
Author

altryne Apr 9, 2025
Author