Metrics: Requirements for safe attribute removal

**What are you trying to achieve?**

Make it safe to remove labels from OTLP metrics in a way that
preserves their semantics.  We have identified two scenarios where it
is not safe to simply strip labels away from metrics data.

1. When all identifying resource attributes have been removed from a
CUMULATIVE metric, the resulting timeseries has overlapping time
intervals (i.e., their `StartTimeUnixNanos` and `TimeUnixNanos`
overlap).
2. When `SumObserver` and `UpDownSumObserver` data points are stripped
of labels that were used to subdivide an observed sum, the resulting
points lose meaning.

An example for both is provided below.

### CUMULATIVE points and the loss of unique labels

For the first item, OTLP has not specified how to interpret CUMULATIVE
data in this scenario.  For reference, current OTLP Metrics Sum data
point defines `AggregationTemporality` as follows:

```
  // DELTA is an AggregationTemporality for a metric aggregator which reports
  // changes since last report time. Successive metrics contain aggregation of
  // values from continuous and non-overlapping intervals.
  //
  // The values for a DELTA metric are based only on the time interval
  // associated with one measurement cycle. There is no dependency on
  // previous measurements like is the case for CUMULATIVE metrics.
```

and:

```
  // CUMULATIVE is an AggregationTemporality for a metric aggregator which
  // reports changes since a fixed start time. This means that current values
  // of a CUMULATIVE metric depend on all previous measurements since the
  // start time. Because of this, the sender is required to retain this state
  // in some form. If this state is lost or invalidated, the CUMULATIVE metric
  // values MUST be reset and a new fixed start time following the last
  // reported measurement time sent MUST be used.
```

Simply translating overlapping CUMULATIVE data into Prometheus results in incorrect interpretation
because timeseries are not meant to have overlapping points (e.g.,
https://github.com/open-telemetry/opentelemetry-collector/issues/2216).
For this reason, we should specify that CUMULATIVE timeseries MUST
always retain at least one uniquely identifying resource attribute.

[This point suggests that `service.instance.id` should be a required
resource attribute.](https://github.com/open-telemetry/opentelemetry-specification/issues/1034): https://github.com/open-telemetry/opentelemetry-specification/issues/1034

We should also specify how consumers are expected to handle the
situation where, because of a mis-configuration, data is presented with some
degree of overlap.  The source of this condition is sometimes unsafe label removal but also results from a so-called "zombie" process, the case where multiple writers _unintentionally_ produce
overlapping points.

This issue proposes that Metrics systems SHOULD ensure that
overlapping points are considered duplicates, not valid points.  When
overlapping points are stored and queried, the system SHOULD ensure
that only one data point is taken.  We might specify that metrics systems
SHOULD apply a heuristic that prefers data points with the youngest
`StartTimeUnixNanos` under this condition, also that they SHOULD warn
the user about zombies or potentially unsafe label removal.

### (UpDown)SumObserver points and subdivided sums

As an example of this scenario, consider observations being made
from a SumObserver callback:

`process.cpu.usage{state=idle}`
`process.cpu.usage{state=user}`
`process.cpu.usage{state=system}`

Now consider that we want to subdivide the process CPU usage by CPU
core number, making many more series:

`process.cpu.usage{state=idle,cpu=0}`
`process.cpu.usage{state=idle,cpu=1}`
`process.cpu.usage{state=idle,cpu=...}`
`process.cpu.usage{state=user,cpu=0}`
`process.cpu.usage{state=user,cpu=1}`
`process.cpu.usage{state=user,cpu=...}`
`process.cpu.usage{state=system,cpu=0}`
`process.cpu.usage{state=system,cpu=1}`
`process.cpu.usage{state=system,cpu=...}`

If these are were expressed as DELTA points (which requires
remembering the last value and performing subtraction), we can
correctly erase the `cpu` label requires by simple erasure.  If these
are expressed as CUMULATIVE points, to erase the `cpu` label means
performing summation, somehow.

### Proposed collector algorithm

I'll propose an algorithm that I think we can use in the
OTel-Collector to correctly erase labels in both of these cases.  It
requires introducing a delay over which aggregation takes place for
both DELTA and CUMULATIVE points.

The OTel-Go model SDK already implements a correct label "reducer"
Processor, but the job of the SDK is much simpler than the Collector's
job in this case, for two reasons: (1) the SDK never resets, (2) the
collection is naturally aligned (i.e., all SumObservers have the same
`TimeUnixNanos`).

Step 1:
- Input: DELTA or CUMULATIVE timeseries data with start time `S`.
- Output: the same timeseries with one added label `StartTime=S`

Step 2: windowing / reducing
- over a short window of time, aggregate each metric by resource and label set
- for DELTA points, take sum
- for CUMULATIVE points, take last value

Step 3: spatial aggregation
- Process one interval
- Remove any unwanted labels
- Remove any unwanted resource attributes
- Remove the `StartTime` label
- Sum the resulting timeseries in each window
- Output w/ a new start time (of the aggregator)

This mimics the behavior of the OTel-Go SDK by introducing a temporary
`StartTime` label to keep distinct timeseries separate until they are
summed.  I will follow up with a more thorough explanation of this algorithm.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics: Requirements for safe attribute removal #1297

CUMULATIVE points and the loss of unique labels

(UpDown)SumObserver points and subdivided sums

Proposed collector algorithm

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metrics: Requirements for safe attribute removal #1297

Description

CUMULATIVE points and the loss of unique labels

(UpDown)SumObserver points and subdivided sums

Proposed collector algorithm

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions