Proposed updates to #184#1
Proposed updates to #184#1jmacd wants to merge 3 commits intocarlosalberto:add-span-dropped-metricsfrom
Conversation
| leakage. Multi-level colletor topologies should allow configuration | ||
| of distinct domains (e.g., `agent` and `gateway`). | ||
|
|
||
| ### Basic level of detail |
There was a problem hiding this comment.
What is the value of having this level? It saves a single binary attribute, but there are plenty of other attributes that are required (domain, name, signal, etc), so the complexity of the spec doesn't seem to be warranted.
There was a problem hiding this comment.
Saving a boolean attribute means having half as many (i.e., one less) timeseries. The information available in the attribute is almost redundant, so I think having a way to avoid the additional 1 timeseries matters.
When you have metrics on a pipeline, the information available by having a success attribute (i.e., one additional series) can be inferred by comparing the subsequent component's totals. This is admittedly a recursive definition -- for the subsequent component to establish it's success/failure rate it will need its own subsequent component's totals, and the final stage in a pipeline will likely not want to use basic-level metrics for this reason. If Total(x) is the sum of the single metric for a component X, the recursive rule for deriving Success/Failure of that component is:
Dropped(this) = Total(next) - Total(this)
Failed(this) = Dropped(this) + Failed(next)
Success(this) = Total(this) - Failed(this)
| the exporter's `success=false` to determine the number of items | ||
| dropped by the processor, for example. | ||
|
|
||
| ### Detailed metrics |
There was a problem hiding this comment.
kinda similar comment to basic - why have a separate level? What's the objective worth complicating the spec this way?
There was a problem hiding this comment.
This is about letting users trade costs based on what they need/want to observe: more metrics may be useful, but they're just additional expense when they're not being used.
I mentioned a personal side-story that led me to this realization in today's Spec SIG: to monitor a water system is similar to monitoring a telemetry pipeline, and it's also a situation where each individual metric is a substantial expense. The minimum number of meters necessary to calculate total leakage in the system is 1 meter for (total) system production and 1 meter per user with a service connection. From total in and total out we can compute leakage, which is equivalent with the calculation for dropped items .
This leads to a conclusion that the minimum-cost configuration for a telemetry pipeline, capable of computing a global Dropped statistic, would use Basic-level detail in each SDK, (disabled metrics for all intermediate collectors), and Normal-level detail for the final component of the final collector in the pipeline. If the user is in a situation where the metrics from the SDKs are not comparable with the metrics from subsequent stages in the pipeline for ay reason, they should use Normal-level detail in the SDK.
I'm also aware of tracing pipelines where there are rate limits enforced at the destination. This is a scenario where if the response code is resource_exhausted I should turn up sampling, if it's timeout I should complain to my backend team about an SLO violation, and if it's queue_full it means I should reconfigure the SDK.
There was a problem hiding this comment.
But why shouldn't this be handled with just the pre-aggregation rules ("views"?) instead of making it a problem for the exporters / components to know about different levels?
There was a problem hiding this comment.
(This content will appear in a new location, I'm writing an OTEP.)
My assumption is that this would be implemented using views, and the text of a semantic convention would be explaining which views to configure at which level of detail.
@carlosalberto
After reviewing the collector's equivalent metrics and auditing the code framework there, I came up with these recommendations in the form of a (large) change to yours.
This is also very speculative--I edited the generated YAML file to demonstrate the outcome I would like, and it will take a small improvement to the build tools to achieve the intended results.
Following from the collector's example, I am proposing three levels of detail, which would be called "basic", "normal", and "detailed". In the current tooling, we have "required" and "recommended" which I take as equivalent to "basic" and "normal". To this I would add "detailed", so the tools need a small change. Please review the generated file as if I had generated it, and the work needs to be put back into
otel.yamliiuc.