Skip to content

Add reason dimension to exporter and receiver failure metrics#10158

Closed
0x006EA1E5 wants to merge 1 commit intoopen-telemetry:mainfrom
0x006EA1E5:exporter-failure-reason-attribute
Closed

Add reason dimension to exporter and receiver failure metrics#10158
0x006EA1E5 wants to merge 1 commit intoopen-telemetry:mainfrom
0x006EA1E5:exporter-failure-reason-attribute

Conversation

@0x006EA1E5
Copy link
Copy Markdown

Description

Adds reason attribute to otelcol_exporter_send_failed_* and otelcol_receiver_refused_* metrics

Link to tracking issue

Fixes #10157

Testing

TODO

Documentation

TODO

@0x006EA1E5 0x006EA1E5 requested review from a team and djaglowski May 15, 2024 13:57
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 15, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: 0x006EA1E5 / name: Greg Eales (59f17fd)

@0x006EA1E5 0x006EA1E5 force-pushed the exporter-failure-reason-attribute branch from 80426d9 to 47a2db5 Compare May 15, 2024 13:59
@0x006EA1E5 0x006EA1E5 force-pushed the exporter-failure-reason-attribute branch from 47a2db5 to 59f17fd Compare May 15, 2024 14:03
@bogdandrutu
Copy link
Copy Markdown
Member

In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.

@0x006EA1E5
Copy link
Copy Markdown
Author

0x006EA1E5 commented May 16, 2024

In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.

The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37

So the cardinality is around 17 at most.

Status code is commonly used as a metric dimension, for example for http metrics.

And typically, (in my experience of the collector), the actual number of statuses seen in responses will be much lower, so time series will not be generated for most of the possible values. Actually, in my experience, the error will normally be UNAVAILABLE, with a much lower number of UNKNOWN, DEADLINE_EXCEEDED, and RESOURCE_EXHAUSTED, so I wouldn't expect cardinality to increase by so much overall.

I'm also suggesting that we only add this dimension when the telemetry is configured as LevelDetailed, so users worried about cardinality have some control here.

@bogdandrutu
Copy link
Copy Markdown
Member

The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37

Why not accept the code then?

@0x006EA1E5
Copy link
Copy Markdown
Author

Why not accept the code then?

I don't understand. Do you mean use the numeric status code instead of the status code text?

@0x006EA1E5
Copy link
Copy Markdown
Author

I don't mind using either the numeric code, or the equivalent name, although I would think the name would be a bit more informative / easier to read.

And as the exporter could be using either GRPC or HTTP (or potentially another protocol), then the GRPC status code number may be a bit confusing.

@0x006EA1E5
Copy link
Copy Markdown
Author

Is there anything I can do to progress this?

@atoulme
Copy link
Copy Markdown
Contributor

atoulme commented Jun 5, 2024

You can join a SIG meeting, we have one for the collector in 10 minutes. It runs weekly.

@codeboten
Copy link
Copy Markdown
Contributor

There's an otep that discusses additional details around monitoring a telemetry pipeline open-telemetry/oteps#259, might be worth taking a look there as well

@github-actions
Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions Bot added the Stale label Jun 21, 2024
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jul 6, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions Bot closed this Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add 'reason' attribute to otelcol_exporter_send_failed_* metrics

4 participants