[OTEP] Telemetry Policy by jsuereth · Pull Request #4738 · open-telemetry/opentelemetry-specification

jsuereth · 2025-11-17T21:15:01Z

A proposal to introduce a new concept in OpenTelemetry: "Telemetry Policy".

Policies are independent rules to enforce on generated telemetry. Each policy is atomic, self-contained, and understandable in isolation. The execution model supports tens of thousands of policies without degradation. A policy works the same way whether it runs in an SDK, a Collector, or any other component that implements the OpenTelemetry specification. Policies are intended to be defined centrally and distributed to running systems.

This is an evolution from many discussions during KubeCON NA 2025.

jaronoff97 · 2025-11-17T21:16:07Z

Thank you for opening this! Feel free to assign me here as I begin working on this.

jsuereth · 2025-11-17T21:43:47Z

@jaronoff97 Honestly - any piece you're comfortable taking, feel free to expand. It still needs a lot of fleshing out, and I know you had a lot of great ideas on the high level design, so let's start there and then create break-out sections for details.

I was thinking the following divisions (with caveat I'm happy if more folks want to help participate in this proposal, but we need to flesh out the idea far further before I think that would work):

@jaronoff97 / @jsuereth Can pair on the following:
- Defining key aspects of Policy (e.g. idempotent, mergable, etc. in more detail).
- Providing an example policy / configuration-setup (for Operator, Collector + SDK)
- OpenTelemetry Operator interaction with Policy
@andykellr / @jaronoff97 Can pair on the following:
- Interaction between Policy + OpAMP
- Necessary components/setup in OTEL Collector
@jsuereth / @trask
- SDK necessary components
- Alternatives considered / prior art

dashpole · 2025-11-18T19:45:17Z

+
+Note that mitigations do not need to be complete *solutions*, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP!
+
+## Prior art and alternatives


I would love to see alternatives here. We've discussed things like dynamically-reloadable or merge rules for declarative config, and it would help reinforce why we need a new concept to solve the problems you are interested in.

Agreed, I need to write down the treatment of why dynamically reloaded config doesn't solve the problems that motivate the proposal.

My answer to your other comment, hopefully, hints at that, but it'll be a longer write-up.

dashpole · 2025-11-18T20:13:23Z

+- TODO - *implicily* a policy has a target resource / signal it is aimed at.
+  This will be used to route policies to destinations.
+
+Example policy types include: 


Could we think of this as new "subtypes" of declarative config that can be used in a standalone way? E.g. if we think of the current declarative config as configuration as type "SDK", we could define sub-types like "sampler", "view", or "log-record-processor"?

If we can, I would love to keep the same yaml structure / definitions for these policies that we currently have in the declarative config so we avoid introducing another structured definition of what a "sampler" is. Or do you think because this is targeted at the collector as well that isn't feasible?

I'd expect the declarative config for a policy-component to be used directly in declarative config:

So something like:

- my_policy_component: - default_policies - type: xyz ... the policy yaml...

The primary difference between the policy for sampling and a "sampler" will actually be in flexibility. A sampler component could be written in any language, allow any code and its configuration must be open. A sampler policy MUST have a well-defined behavior, have the same configuration and behavior in all languages or implementations.

So primarily, a policy is highly limited in a way extension points are not.

jack-berg · 2025-11-19T17:22:34Z

+control the sampling of spans in SDKs. However, File-based configuration does
+not require dynamic reloading of configuration. This means attempting to
+provide a solution like Jaeger-remote-sampler with just OpAMP + file-based
+config is impossible, today.


Dynamic control of SDKs is something that should be able to be built on top of or as an evolution of declarative config. I / we have been conscious of this eventually while building declarative config and I don't think anything will get in the way. Also, I hope that minimally, the declarative config data model can be used as a way for servers to communicate the desired configuration state of components in a dynamic config scenario.

While I agree to a degree, the type of control and abstraction these proposal seeks to enable is NOT possible without agreement on semantics and use-cases across diverse implementations.

E.g. the declarative config + OpAMP could be used to send any config to any component. What it doesn't do, and what we need to sort out, is how to understand what config can be sent to what component, and how to drive control / policy independent of implementation or pipeline set-up, e.g.

Imagine a world where we can control the reporting of metrics across open telemetry SDKs, custom implementations and Prometheus SDKs because we agreed to the semantics of policy independent of configuration.

In Declarative config I'd expect things that cannot be shared between different implementations:

Queue/Buffer sizes that are specific to my pipeline setup

Threading / GC configuration specific to my language

In a Policy we should be limited to ONLY things that can be shared broadly, across
implementations and have well-defined semantics for how to enforce them.

So I see Declarative config as encompassing more than just policies, where policies would be a subset of what you'd find. Additionally, Policies can be independent things that you can bundle together. I should be able to "add" a policy at any point without needing to understand how it interacts with other components. AN example of this - If I have a configuration reporting metrics, that configuration would have a MetricReader->MetricExporter right? What If there's multiple. How would I know what to change generically, if I just wanted to say "stop producing metric X". Policies are ignorant of this. They just push a policy down and the SDK would be expected to enforce this via a PolicyMetricReader that's configured to pay attention to a metric filter policy.

Apologies not all of this is fleshed out, as it's a working draft, and one we're working on in the repo. Please continue to ask questions and I'll use that to flesh out the motivation more.

This is an exact case I have, turning off metrics. And turning them back on. I implement this by having a flag in a custom exporter which stops/restarts exports. A generic solution to turning it off would be to change the exporter config to none, then I guess you could re-enable by setting again to otlp, but that implies a much more complex action in the SDK rather than switching a boolean on/off

I added this to the alternatives considered discussion

Imagine a world where we can control the reporting of metrics across open telemetry SDKs, custom implementations and Prometheus SDKs

A bold vision. I think I was definitely misunderstanding the scope. I'll revise my position: If we want dynamic control solutions specifically for otel SDKs, the declarative config data model should play a role, because not using it means introducing yet another config interface (YACI 😛). With a broader scope targeting other tools besides otel SDKs, we would of course need something not loaded with otel SDK vocabulary / baggage.

Should this type of thing even live in otel or in some neutral territory? (reminds me of the relationship between w3c trace content and opentelemetry) Are there other ecosystems that have expressed interest in or that we've reached out to for collaborating?

Great quesitons!

If we want dynamic control solutions specifically for otel SDKs, the declarative config data model should play a role

100%!

Should this type of thing even live in otel or in some neutral territory?

Great question. I personally think this belongs in OTEL and should "feel native" to otel, but allow any component in o11y space to interact with it. This can increase the reach of "effective opentelemetry" as components which support writing OTLP can also participate with policies. However, to your question above, if this wasn't first-class in otel, how would we make sure our declarative config data model plays an important role?

Are there other ecosystems that have expressed interest in or that we've reached out to for collaborating?

The idea is the outcome of discussions with both Envoy (and their xDS control plane folks) and Google's Monarch team (see #4672). I would love to pull in more folks to collaborate for sure. First, I want to make sure we all understand the vision, scope and goals.

This PR was meant to be a place for those of us who started discussing to flesh out the proposal in place (as draft), so this PR is meant to be collecting that interest and refining the message. APologies it was rough when you first reviewed it.

jack-berg · 2025-11-19T17:27:53Z

+- `log-filter`: define how logs are sampled/filtered
+- `attribute-redaction`: define attributes which need redaction/removal.
+- `metric-aggregation`: define how metrics should be aggregated (i.e. views).
+- `exemplar-sampling`: define how exemplars are sampled


This reads like a subset of declarative configuration capabilities. Wouldn't it be easier to unify on one data model (i.e. declarative config) for expressing the desired configuration, and build tooling to detect / apply diffs when a a change is pushed from a remote server?

I.e. an app starts with:

file_format: 1.0 tracer_provider: processors: - batch: exporter: otlp_http: sampler: parent_based: root: trace_based: ratio: 1.0

Later, a remote server pushes a new configuration state with an updated ratio for the trace id ratio sampler:

file_format: 1.0 tracer_provider: processors: - batch: exporter: otlp_http: sampler: parent_based: root: trace_based: ratio: 0.5 # reduce ratio from 1.0 to 0.5

Some controller is responsible for evaluating the diff between the current state and the desired state, and computing / executing update steps as allowed. In this case, substitute the sampler.

You can read some of my rationale at the bottom of the OTEP.

Effectively:

I think policies will be used as-in in declarative config. There would be a component in Declarative config that can be configured with a default set of policies.

I think policies will be highly limited in expected behavior / security profile vs. declarative config.

I think SDK configuration will opt-in to allow remote-policy control, with explicit permissions per-policy

I do not think policies will alter pipeline setup or shape. Policies should have well defined insertion points already defined via Config where they will be enforced.

Policies will need to have a mechanism via OpAMP to advertise they can be accepted and handled - we can use "custom capabilities" for this.

So there are a lot of similarities, but the key difference is the limitations.

Imagine you are an enterprise running a large fleet of SDKs. You want or you only can control this fleet with remote configuration. It would be great to not be limited to the mentioned policy types. I want also to be able to control what instrumentations are enabled, maybe I need to pause sending of signals during an incident, control the SDKs internal logging level etc.

@hegerchr I do expect we could provide more policies than the ones listed. My main point is why configuration, as it stands, doesn't solve the goals of this proposal.

Regarding managing a fleet of SDKs / Controllers - That's where this policy proposal originates - controlling Google's collection fleet and lessons we've learned in the process. @menderico has talk at KubeCon this year that will cover more details if you're curious what we're looking for.

Co-authored-by: Jack Berg <[email protected]>

jack-berg · 2025-11-20T15:31:20Z

+meter_provider:
+  readers:
+    - my_custom_metric_filtering_reader:
+        my_filter_config: # defines what to filter


You want to filter metrics using a filtering reader (this component doesn't exist in the SDK spec and so would have to be custom) vs. views or meter config?

I'm not sure, I can update this to use views instead as well. I was taking from the proposed OTEP where you can control both the reporting of a metric and the report interval (i.e. periodic metric reader would need configuration for how often to report each set of metrics).

jack-berg · 2025-11-20T15:32:15Z

+Here, I've created a custom component in java to allow filtering which metrics are read.
+However, to insert / use this component I need to have all of the following:
+
+- Know that this component exists in the java SDK


If this is a popular use case we should extend the SDK spec to add an additional built in component. We're too reluctant to do this right now.

That still won't tell me if it's safe to send configuration to an SDK or not. I need to know, at runtime, that the version of the SDK I'm trying to control will support that config or if I'll crash a key component.

Additionally, it doesn't help me ignore the implementation detail. E.g. what If I also want to control Prometheus client library? We don't own their config or their specification. However, we could build something that interacts with remote policies, similar to Jaeger-Remote-Sampler of today for traces.

Can the SDK be self-protecting? When the SDK receives a configuration that it does not fully support it can reject the configuration and send an error other information about the rejection back using OpAMP.

Can the SDK be self-protecting? When the SDK receives a configuration that it does not fully support it can reject the configuration and send an error other information about the rejection back using OpAMP.

That still doesn't answer my second question - Do I need to translate configuration to both OTEL + Prometheus if I want to control a prometheus client library and otel SDK ?

What if all I care about is to stop reporting metric X and I don't care what impelmentation is generating it? Having some minor flake/discrepency in the config I push to an OpAMP server lead to rejection of config in HALF of my SDK clients would be problematic. Having to push multiple config files (and understand if i've broken the entire pipeline by accident, i ways the SDK cannot reject) is risky.

thompson-tomo · 2025-11-21T02:30:11Z

+  - `PolicyProvider`
+    - Can "push" policies into the provider.
+    - Provides "observable" access to policies (e.g. notify on change)


I would forsee 2 SDK components:

Policy Provider: constructs the identity of the agent and contains a collection of policy detectors. It exposes methods to access the collection of policies provided by the detectors and notify a component of an updated policy/profile.

Policy Agent: provides a detect method enabling components to report it's policy. Designed to be embedded in components.

thompson-tomo · 2025-11-21T03:49:15Z

+Every policy is defined with the following:
+
+- A `type` denoting the use case for the policy
+- A json schema denoting what a valid definitin of the policy entails.


Why don't we define a policy as follows:

Policy Definitions: an array of policy definitions

Instrumentation scope: identifies the component who this policy is for and mirrors otlp definition.

The policy definition would contain the type & schema property as above.

Having the scope allows for restriction of who the policy applies to. information about the agent/resource is left out as opamp natively provide this info.

thompson-tomo · 2025-11-21T04:06:56Z

+  - Extension Points
+    - `PolicySampler`: Pulls relevant `trace-sampling` policies from
+      PolicyProvider, and uses them.
+    - `PolicyLogProcessor`: Pulls Relevant `log-filter` policies from
+      PolicyProvider and uses them.
+    - `PolicyPeriodicMetricReader`: Pulls Relevant `metric-rate` policies
+      from PolicyProvider and uses them to export metrics.
+    - TODO: SDK-wide attribute processors
+    - TODO: SDK-view policies


Could this be simplified to instead use an applyPolicy method on the PolicyAgent. The policyprovider can use the the data (scope & policy type) from the detect method to only invoke apply on the correct audience

thompson-tomo · 2025-11-21T04:15:29Z

+- OpAmp Interaction
+  - Policy = custom extension
+  - Can we safely "roll back" a policy if it caused a breakage?


Opamp agent/client should be able to report back supported policies to opamp server.

Opamp server should be able to inform client when a policy is updated including the scope it applies to.

Signed-off-by: jaronoff97 <[email protected]>

smith

Just typo and OpAMP capitalization suggestions. Makes sense reading through it. Sounds like a useful concept.

smith · 2025-12-03T03:18:22Z

+the configuration layout of an OpenTelemetry collector.  If a user asked the
+server to "filter out all attributes starting with `x.`", the server would
+need to understand/parse the OpenTelemetry collector configuration. If the
+controlling sever was also managing an OpenTelemetry SDK, then it would need


Suggested change

controlling sever was also managing an OpenTelemetry SDK, then it would need

controlling server was also managing an OpenTelemetry SDK, then it would need

smith · 2025-12-03T03:19:56Z

+"guaranteed" to be usable due to specification language. For example, today
+one can use the Jaeger Remote Sampler specified for OpenTelemetry SDKs and the
+jaeger remote sampler extension in the OpenTelemetry collector to dynamically
+control the sampling of spans in SDKs. However, File-based configuration does


Suggested change

control the sampling of spans in SDKs. However, File-based configuration does

control the sampling of spans in SDKs. However, file-based configuration does

smith · 2025-12-03T03:20:47Z

+control the sampling of spans in SDKs. However, File-based configuration does
+not require dynamic reloading of configuration. This means attempting to
+provide a solution like Jaeger-remote-sampler with just OpAMP + file-based
+config is impossible, today.


Suggested change

config is impossible, today.

config is impossible today.

smith · 2025-12-03T03:21:33Z

+config is impossible, today.
+
+However, we believe there is a way to achieve our goals without changing
+the direction of OpAmp or File-based configuration. Instead we can break apart


Suggested change

the direction of OpAmp or File-based configuration. Instead we can break apart

the direction of OpAMP or file-based configuration. Instead we can break apart

smith · 2025-12-03T03:32:57Z

+
+* A `FileProvider` will either use a default merger from the format (like the default proto merge), or accept a parameter that specifies which merger is expected when reading the specific file format (for example, for JSON).
+* A HTTP provider can use different file formats to decide which merger to use, as specified in the RFCs for JSON patch formats.
+* OpAmp providers could add a field specifying the merger as well as the data being transmitted, plus a mechanism for systems to inform each other which mergers are avaliable and how the data is expected to be merged.


Suggested change

* OpAmp providers could add a field specifying the merger as well as the data being transmitted, plus a mechanism for systems to inform each other which mergers are avaliable and how the data is expected to be merged.

* OpAMP providers could add a field specifying the merger as well as the data being transmitted, plus a mechanism for systems to inform each other which mergers are avaliable and how the data is expected to be merged.

andrzej-stencel · 2025-12-03T10:08:01Z

Looking at the example policy types mentioned, they all seem to be about the transformation of telemetry. Conceptually, I'd love to see OTTL serving as the universal language for describing telemetry transformations. Caveats:

Likely not all transformations are currently supported aka. implemented in OTTL; some might never (trace sampling?)
Perhaps there are other policy types that are not telemetry transformations, just not mentioned in the examples?

jaronoff97 · 2025-12-03T16:06:32Z

+- Should this specification give recommendations for the server protobufs
+- How should we handle policy merging?
+  - (jacob) Could policies contain a priority and it's up to the providers to design around this?
+


@andrzej-stencel i want to continue the OTTL discussion in a thread here.

Looking at the example policy types mentioned, they all seem to be about the transformation of telemetry. Conceptually, I'd love to see OTTL serving as the universal language for describing telemetry transformations. Caveats:
Likely not all transformations are currently supported aka. implemented in OTTL; some might never (trace sampling?)
Perhaps there are other policy types that are not telemetry transformations, just not mentioned in the examples?

IMO, we shouldn't tie ourselves to OTTL just yet for a few big reasons:

No formal grammar (not very portable, limits implementations)

Not versioned, no guarantees about breaking changes

Currently, none of the example policy types have to do with transformation (save the redaction policy), and most are focused on filtration or minor configuration. I have a POC for the policy proto, and one of the biggest things for this project to be successful is that policies can be run efficiently based on matching conditions (i.e. run policy X when Y condition is met). For that matching logic, I think it makes sense to be based on semantic convention which is a well-defined standard at this point. I think for now, given we're proposing the overarching Policy model, I don't want to get too caught up in a single policy type and rather focus on the concept as a whole. You could imagine in the future as we design specific policy types, one would be of type ottl allowing an implementation (such as a collector processor) to run ottl based on its loaded policies. I think that should be okay, but let me know if any part of that doesn't track. Thank you 🙇

You raise very good points about OTTL's lack of grammar and stability guarantees. Also limiting policy types to OTTL is probably too restrictive - this was my suspicion too. Thank you for bringing these arguments.

hegerchr · 2025-12-05T08:21:15Z

+- **Standalone**: I don't need to understand how a pipeline is configured to define
+  policy.
+- **Dynamic**: We expect policies to be defined and driven outside the lifecycle
+  of a single collector or SDK. This means the SDK behavior needs the ability


I would really love to see things are getting more dynamic in the SDKs to be able to have more possibilities for remote configuration support.

hegerchr · 2025-12-05T08:23:58Z

+  telemetry-plane safely. E.g. if both an SDK and collector obtain an
+  attribute-filter policy, it would only occur once.
+
+Every policy is defined with the following:


Does a policy has a target in the sense of a service name (thinking in the context of SDKs) that denotes the policy applies to only that service?

im still working through the actual message for the policy and it being service specific was one design. The one i'm going through has a list of matchers where a matcher can function on the resource, scope, attrs, or telemetry specific data. This should be flexible enough to allow a user to match on just a service.name or a log body or a trace scope.

jackshirazi · 2026-01-09T10:27:38Z

A full Java agent working implementation is here in this branch. The implementation is a fully working extension for the Java agent which allows you to change sampling rate at runtime with a change to a local file

jackshirazi · 2026-01-12T11:07:36Z

+be commutative and applied in any order. For example, we could have a policy
+where a given provider value always overrides another provider if one of the
+fields is the provider name and the post-merge algorithm takes this information
+into account.


I think we should add a recommendation that if possible overrides should be supported, and in the order opamp > http > file, ie a change specified by an opamp provider would override other provider changes - including if the other providers have subsequent updates. If we don't specify this detail, we're going to have free-for-all inconsistent implementations about overrides.

This follows the natural (and I think most existing implementations for non-otel agent management) that central config overrides all other changes

yup this makes sense to me, I've been working through the same thought/issue, and would be good to mandate this.

Here's a suggestion

Overrides SHOULD be supported. A policy should retain the provider source as one of OPAMP(1), HTTP(2), FILE(3), CUSTOM(?) (priority in brackets; custom priority is specified by the implementer; there can be multiple custom providers, 1 is higher priority than 2, etc). To be implemented, any new policy field has to have priority at least as high as the already implemented policy field otherwise it is dropped. Where fields cannot be merged consistently while still applying the priority override , the lower priority policy is dropped in entirety

jackshirazi · 2026-02-09T12:35:18Z

+```proto
+message Policy {
+  // Unique identifier for this policy
+  string id = 1;


Can you give an example, I don't get id vs name

id is a global alphanumeric stable identifier, name is more like metadata for the user.

id: abd123noifoiqwnfl name: Filter healthcheck spans

So name could be used as ID. I don't really see the need for an extra field

jackshirazi · 2026-02-09T12:48:53Z

+  agent-config payload, reusing OpAMP's existing change-detection mechanisms.
+
+#### Runtime conflict resolution
+


This section assumes perfect transport order, or at least doesn't make it clear to me what happens with out of order updates.
For example policy "export-traces: on" is sent, but get's delayed. "export-traces: off" is sent and received, the agent turns off export of traces. Then the delayed "export-traces: on" gets pushed through. I have 2 problems now

The order of application is clearly wrong, the intention is to have no exporting but the last policy turns it on

Even if I wanted to combine these, it's impossible to come up with a commutative reduction for these two operations

I would favour a naive last timestamp approach (since we've got the prioritization, it's reasonable to assume that a given policy source has a specific consistent timestamper)

So, at high level we did not want to constraint it to timestamp only since there are operations that can be resolved without it, but it is fair to add an example with it as well. It is up to the specific policy to decide how to resolve conflicts.

I'll add one ASAP.

tigrannajaryan

I am a bit confused about the proposal so please bear with me if I ask stupid questions. I need your help to understand the OTEP.

tigrannajaryan · 2026-02-17T22:07:02Z

+When using OpAMP with an OpenTelemetry Collector, the controlling server needs
+to understand the configuration layout of that specific collector. If a user
+asks the server to "filter out all attributes starting with `x.`", the server
+must understand and parse the collector configuration. If the same server also


Can you please clarify why should the server parse configuration? I think the server is supposed to generate configuration and send to the Collector, not parse it.

Let's imagine we try to create a UI to control collectors and configure them. Don't I need to understand the commands I'm sending to collectors, or do you envision this always being "blob of bytes"?

Let's then imagine I try to create some kind of abstraction around controlling collectors, where I want to support more than just specifically an exact version of one otel collector - how does that work?

I think both approaches are possible. I can envision a system where the UI gives you an edit box where the user is free to input a config and then that config is sent via OpAMP to the agent without the server attempting to interpret it in any way. What you describe is also possible, i.e. a UI that is more knowledgable about specifics of the config and give you richer controls to compose the config.

I now understand what you are saying but reading about "parsing" was confusing. Specifically, the rich UI functionality you describe doesn't necessarily need to parse a config. It does need to know the schema of the config and be able to generate valid config blobs that match that schema. Parsing may be part of the picture for example to visualize the effective config but that seems to be a topic that is outside of this OTEP's scope.

tigrannajaryan · 2026-02-17T22:09:35Z

+does not require dynamic reloading. A solution combining OpAMP with file-based
+configuration cannot provide the same dynamic behavior.


A config pushed through OpAMP can be applied dynamically if the recipient (the Collector or SDK) chooses to apply it dynamically. What inherently in OpAMP or file-based config prevents the dynamic behavior?

I can implement the OpAMP spec + Configuration spec without the requirement to support dynamic adaptation, right?

That's what this is saying. There's nothing required in OpAMP to allow this to work, we need a component that requires dynamic behaivor.

There's nuances to this point as well, but the basic gist is - We aren't requiring dynamic loading, so we'd need to change our stance or create something that DOES require it.

OK, that makes sense. I think the part that says "cannot" confused me.

i updated the language based on this conversation

tigrannajaryan · 2026-02-17T22:13:53Z

+  change post-instantiation.
+- **Idempotent**: I can give a policy to multiple components in a
+  telemetry-plane safely. E.g. if both an SDK and collector obtain an
+  attribute-filter policy, it would only occur once.


Can you please clarify idempotency on a different example. For example if the policy is "multiply all numeric gauge values by 100" (a policy that converts from fractions to percentages), can I expect it to do the multiplication only once if the policy is applied both in the SDK and Collector?

If you wanted to enforce such a policy, YES - you'd need a way to track it was performed in the SDK so the Collector would not also do the work.

Policies are not generic transformations though, so I'm not sure we'd have a "multiply all gauge values by 100", but we MAY have a "convert milliseconds to seconds" where you can determine, via the contents of OTLP, if the policy was already enforced.

What I want to primarily understand is whether we are implying a communication mechanism that helps ensure idempotency (e.g. signal that the policy was already applied and should not be applied again) or we are limiting policies to only be designed in such a way that without such communication mechanism idempotency is still ensured (just like it is in the example you brought where you only apply the transform if unit=="milliseconds").

The OTEP does not answer this question but I think it is an important one since it defines the design space.

This is an interesting scenario.

A simple rule that says "multiply all gauges by 100" would indeed not be idempotent. But let's say hypothetically we would be handling metrics and there is an "unit" attribute to the metric, so we would have something like

Metric ("my attribute" = "my value", "unit" = "fraction") = 0.05 as double

In this example, I could theoretically have a policy that says move any metric that has unit=fraction to unit=percentage and multiply it by 100, leading to the following result.

Metric ("my attribute" = "my value", "unit" = "fraction") = 0.05 as double -> Metric ("my attribute" = "my value", "unit" = "percentage") = 5 as double

The policy transformation can be applied multiple times, any time after the first one will not match it and therefore the end result is always the same, i.e., the fraction is transformed into a percentage only once, and has always a value that is 100 times larger than the original fraction.

I admit this is a very weird example and probably not what policies are originally intended, but just wanted to illustrate that there are mechanisms to make an operation that is not idempotent into one that is by shaping the data and the operation to meet the requirements.

tigrannajaryan · 2026-02-17T22:14:27Z

+- A `type` denoting the use case for the policy
+- A schema denoting what a valid definition of the policy entails, describing
+  how servers should present the policy to customers.
+- A specification denoting behavior the policy enforces, i.e., for a given JSON


What "JSON entry" this refers to?

i made this more clear, lmk what you think

tigrannajaryan · 2026-02-17T22:38:05Z

+will. Now - if we want to drop a particular metric from being reported, which
+pipeline do we modify? Should we construct a new processor for that purpose?
+Should we always do so?
+


I am not sure I understand how the policies solve this particular problem. Can you add more details here? How are policies applied by the Collector? Is there an expectation that there is a policy "processor" that is always in all pipelines and knows how to apply any policy it receives?

Yes - we can update this.

The idea is that you'd have a policy-processor in all pipelines, and then you no longer care about pipeline setup from a remote-control perpsective. You can define your intent in the policy and it will impact all pipelines.

OK, I think it makes sense to me. The follow up question is what is the process through which the policies are designed, published and implemented in processors and what happens if a new policy is published after the processor is released. Is there a versioning concept, a check of supported policies in the processor, etc.

I assume we don't expect policies to be expressed in some general purpose "language" that the processor implements once and all policies released in the future just work without needing to update the processor implementation (I can indeed imagine an approach like that, but this OTEP doesn't seem to suggest that).

tigrannajaryan · 2026-02-18T19:18:07Z

+
+1. [Tero edge](https://github.com/usetero/edge)
+   1. a zig implementation of a proxy that applies policies.
+   2. later we will show our policy representation as a sample of this OTEP.


If you have a prototype in a more well-known language that would be great to attach here.

I am especially interested in seeing what an implementation looks like in the Collector.

@tigrannajaryan i added the other implementations i have completed.

tigrannajaryan · 2026-02-18T19:22:22Z

+
+Example policy types include:
+
+- `trace-sampling`: define how traces are sampled


So just to be clear, the idempotency requirement means that a random sampler which unconditionally samples a random population is not possible. Is this correct?

I think we could accomplish this - it depend on how we construct the policy definition to begin with.

Again - each policy has SOME control over how "merging" works - so we could do this. Would be worth investigating and putting together a prototype.

tigrannajaryan · 2026-02-18T19:23:23Z

+
+Policies are designed to be straightforward objects with little to no logic tied
+to them. Policies are also designed to be agnostic to the transport,
+implementation, and data type. It is the goal of the ecosystem to support


What does agnostic to data type mean? Does it mean a sampler policy will work for metrics, traces and logs in some agnostic way?

A few things:

This means we don't want "policy" tied to a specific {Signal}Provider in an SDK. I may have different policy types per signal, but I don't have a completely different mental model of how they work or are distributed per-signal, and I could use the same policy between signal types. (e.g. as you state if we wanted a sampling policy to apply uniformly across signal types)

This means we can add a new policy definition without breaking any existing policy definitions in the ecosystem.

tigrannajaryan · 2026-02-18T19:24:07Z

+Policies are designed to be straightforward objects with little to no logic tied
+to them. Policies are also designed to be agnostic to the transport,
+implementation, and data type. It is the goal of the ecosystem to support
+policies in various ways. Policies MUST be additive and MUST NOT break existing


What does "additive" mean in this context? "Additive" to other policies or to something else?

Additive in the sense that I could succesfully use OpenTelemetry without policies. If I want to use Policies as my remote control mechanism for my fleet, I "add" to the hooks appropriate to enable them.

I.e. I can be 100% succesful with pure file-based configuration or OpAMP or both. If I want to use policies, I can choose to, and then leverage the benefits there-in.

atoulme

Discussed with @jaronoff97 and this proposal looks very promising to me.

atoulme · 2026-03-04T01:34:07Z

cross linking from this OTEP a RFC on the collector which touches on similar use cases: open-telemetry/opentelemetry-collector#14469
Two use cases seem like they could be construed to apply here:

turn on or off a data collection pipeline
perform a single data collection

Some don't apply cleanly such as the horizontal scaling of scrapers.

…elemetry-specification into wip-telemetry-policy-otep

jaronoff97 · 2026-03-25T16:12:43Z

@tigrannajaryan updated the OTEP with lots of changes, please take a look when you get a chance, thank you.

florianl · 2026-04-10T08:55:14Z

+
+```proto
+message Policy {
+  // Unique identifier for this policy


Quick questions on this unique identifier:
If migrating a policy from one environment into another environment, what is the strategy or recommendation around this unique identifier?
And if comparing policies of two different environments, e.g. QA vs Production, is this unique ID considered and/or important?

the id should be a globally unique id. if you were to deploy the same policy to qa and production it would be the same id, however if they have different bodies they would have two different ids. I may not be understanding the question though.

[..] if you were to deploy the same policy to qa and production it would be the same id, however if they have different bodies they would have two different ids.

Thanks - that is what I was looking for.

florianl · 2026-04-10T08:55:50Z

+  int32 hash_seed = 4;           // hash seed for deterministic sampling
+  bool fail_closed = 5;          // reject items on sampling errors
+}
+```


Can we get a ProfileTarget?

yes, however, these protos are mostly for demonstration purposes and not finalized. they will be worked on and discussed separately after this otep is merged.

Co-authored-by: Florian Lehner <[email protected]>

reyang · 2026-04-13T17:40:07Z

+
+- Specify configuration relating to the underlying policy applier
+  implementation.
+  - A policy cannot know where the policy is going to be run.


Could you elaborate a bit more on "where the policy is going to be run"?
For example, I could imagine a policy to say: "the user email address must be removed before the telemetry leaves the customer owned machine/cluster".

This is about ensuring that a policy doesn't have configuration targeting where it will run. A user will want that exact policy, but it's up to them to deploy the policy to achieve that. I.e. the policy would say "remove the user email address from all telemetry" and they would run that in their telemetry infrastructure to ensure the guarantee that it runs prior to leaving customer owned machines/clusters. The intent of this rule is to not mix the "control plane" and the "data plane". The control plane determines where the policies run, while the data plane's purpose is just to run the policy. Policies may have labels / resource attributes to help inform the control plane on where a policy should be delegated, but that's not a requirement.

You can see in the architectural section, we will have multiple policy providers. This is the link from the data plane to the control plane and is what would allow this type of delegation to work.

Let me know if you have more questions about this, happy to elaborate further.

Co-authored-by: Reiley Yang <[email protected]>

Create OTEP based on kubecon discussions on policy control.

e70377e

jsuereth requested review from a team as code owners November 17, 2025 21:15

jsuereth assigned jsuereth and jaronoff97 Nov 17, 2025

dashpole reviewed Nov 18, 2025

View reviewed changes

thompson-tomo mentioned this pull request Nov 19, 2025

Markdown specification based on schema open-telemetry/opentelemetry-configuration#266

Closed

jack-berg reviewed Nov 19, 2025

View reviewed changes

jsuereth and others added 2 commits November 19, 2025 18:00

Update oteps/9999-telemetry-policy.md

da34640

Co-authored-by: Jack Berg <[email protected]>

Add more justification.

6396605

jack-berg reviewed Nov 20, 2025

View reviewed changes

Rename to PR number.

4a092aa

thompson-tomo reviewed Nov 21, 2025

View reviewed changes

jaronoff97 and others added 2 commits November 24, 2025 21:19

update with more details from kubecon

4063e20

Signed-off-by: jaronoff97 <[email protected]>

Add merging spec

1e8697b

smith approved these changes Dec 3, 2025

View reviewed changes

jaronoff97 reviewed Dec 3, 2025

View reviewed changes

Add some todos

26645d4

hegerchr reviewed Dec 5, 2025

View reviewed changes

menderico and others added 2 commits December 8, 2025 15:09

Add post-merge conflict resolution and example

92383dd

update w/ 80 line breaks, better organization, tradeoffs, etc.

930a151

jackshirazi mentioned this pull request Jan 9, 2026

[Dynamic] Add basic Telemetry Policy support to dynamic-control open-telemetry/opentelemetry-java-contrib#2546

Open

10 tasks

jackshirazi reviewed Jan 12, 2026

View reviewed changes

jaronoff97 and others added 2 commits February 6, 2026 14:31

fix chlog

cc255e6

Merge branch 'main' into wip-telemetry-policy-otep

40b05b3

jackshirazi reviewed Feb 9, 2026

View reviewed changes

This was referenced Feb 10, 2026

[dynamic control] Add file reader with line-by-line policy provider open-telemetry/opentelemetry-java-contrib#2622

Merged

[dynamic-control] Add trace sampling-rate implementer open-telemetry/opentelemetry-java-contrib#2634

Merged

trinidadpacheco171-create approved these changes Feb 17, 2026

View reviewed changes

jsuereth changed the title ~~[DRAFT] Telemetry Policy~~ [OTEP] Telemetry Policy Feb 17, 2026

tigrannajaryan reviewed Feb 18, 2026

View reviewed changes

atoulme approved these changes Feb 26, 2026

View reviewed changes

atoulme mentioned this pull request Mar 4, 2026

[chore] RFC for scraper controller extension open-telemetry/opentelemetry-collector#14469

Merged

xrmx mentioned this pull request Mar 5, 2026

Add a basic http OpAMP client open-telemetry/opentelemetry-python-contrib#3635

Merged

12 tasks

jaronoff97 and others added 4 commits March 24, 2026 15:48

update policy spec from feedback

6c3e8de

Merge branch 'main' into wip-telemetry-policy-otep

1b88a16

spcheck

10c4d96

Merge branch 'wip-telemetry-policy-otep' of github.com:jsuereth/opent…

79d5956

…elemetry-specification into wip-telemetry-policy-otep

stevejgordon mentioned this pull request Apr 2, 2026

Add Configuration SDK open-telemetry/opentelemetry-dotnet#6380

Open

florianl reviewed Apr 10, 2026

View reviewed changes

Update oteps/4738-telemetry-policy.md

715b1f6

Co-authored-by: Florian Lehner <[email protected]>

reyang reviewed Apr 13, 2026

View reviewed changes

Comment thread oteps/4738-telemetry-policy.md Outdated

reyang reviewed Apr 13, 2026

View reviewed changes

Update oteps/4738-telemetry-policy.md

f7d013e

Co-authored-by: Reiley Yang <[email protected]>

florianl approved these changes Apr 15, 2026

View reviewed changes

reyang mentioned this pull request May 4, 2026

[blog] Add Security For Legacy Environments Blog Post open-telemetry/opentelemetry.io#9723

Open

4 tasks


		Note that mitigations do not need to be complete solutions, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP!

		## Prior art and alternatives

	controlling sever was also managing an OpenTelemetry SDK, then it would need
	controlling server was also managing an OpenTelemetry SDK, then it would need

	control the sampling of spans in SDKs. However, File-based configuration does
	control the sampling of spans in SDKs. However, file-based configuration does

	the direction of OpAmp or File-based configuration. Instead we can break apart
	the direction of OpAMP or file-based configuration. Instead we can break apart

	* OpAmp providers could add a field specifying the merger as well as the data being transmitted, plus a mechanism for systems to inform each other which mergers are avaliable and how the data is expected to be merged.
	* OpAMP providers could add a field specifying the merger as well as the data being transmitted, plus a mechanism for systems to inform each other which mergers are avaliable and how the data is expected to be merged.

		agent-config payload, reusing OpAMP's existing change-detection mechanisms.

		#### Runtime conflict resolution

		does not require dynamic reloading. A solution combining OpAMP with file-based
		configuration cannot provide the same dynamic behavior.


		Example policy types include:

		- `trace-sampling`: define how traces are sampled

Conversation

jsuereth commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaronoff97 commented Nov 17, 2025

Uh oh!

jsuereth commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jack-berg Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsuereth Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrzej-stencel commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsuereth commented Nov 17, 2025 •

edited

Loading

jack-berg Nov 20, 2025 •

edited

Loading

jsuereth Nov 20, 2025 •

edited

Loading