Skip to content

Stop removing __name__ from results: prevent vector cannot contain metrics with the same labelset #11397

@samjewell

Description

@samjewell

This issue was previously named as:

Identify highest DPM series via PromQL

Problem / Use-Case

As an Admin/Manager/Operator of a Prometheus instance one use-case is to understand, then manage-down the usage on the instance and the cost of running it. Sometimes there are a few metrics with very high DPM (ie. high frequency of scraping / small scrape-interval), which increase "usage" on the instance, and increase costs. For a multi-tenant project which implements PromQL (such as Mimir) this appears as a DPM overage and increase in the dollar costs for that tenant.
The Admin then needs to identify which metrics are responsible for the high DPM; which are the worst offenders. I've tried to find these with the following query:

topk(10, count_over_time({__name__!=""}[1m]))

Unfortunately this fails with:

execution: vector cannot contain metrics with the same labelset

Boiling this example down a bit, this error is present with the query count_over_time({__name__!=""}[1m]). And it's present for my small instance which only has 4,200 metrics. Google suggests to me a typical workaround for this error is to use label_replace to move the original metric-name into a temporary label. So incorporating this fix I get to this query:

topk(10, count_over_time(label_replace({__name__!=""},"name_label","$1","__name__", "(.+)")[1m:]))

But I've now introduced another problem. (Can you spot it?) I had to switch from [1m] to [1m:] ie. convert to a subquery, to avoid parse error: ranges only allowed for vector selectors. But by switching to a subquery, count_over_time no longer counting the DPM of the underlying metrics/series, but instead counts the subquery-evaluations during the interval in question!

So my question is this:
How can I identify the series that have the highest DPM via PromQL, so I can effectively manage cost and usage on my instance? (Or if it weren't via PromQL, how would we change Prometheus to better support this use-case, and how would operators of multi-tenant Prometheus services such as Mimir support their tenants to achieve the same).

Can we make changes to Prometheus to better support this use-case?

Note: I'm aware that the query I'm trying to run can be very computationally expensive, and may fail if the result-set is too big. But I'm not aware of alternatives for the case I'm describing.

Proposal / Ideas

I believe vector cannot contain metrics with the same labelset happens because Prometheus removes the metric-name when performing count_over_time. It does this for any function which changes the 'dimensions' of the metric. This is an opinionated decision taken by the project in the past - I think the opinion is that 'no metric name is better than the wrong metric name', and metrics names typically do include dimensions (such as …_seconds). But personally (as an instance admin, and as someone who builds features for the instance-admin persona) I've found myself hitting this error message a few times, and found it a barrier to learning and using Prometheus.

Some ideas then, to start some discussion:

  • Somehow always stop users from hitting the error vector cannot contain metrics with the same labelset, while also retaining the rule that "we mustn't show the wrong metric name".

    • Retain the metric name, but add a suffix "…DIMS_CHANGED" (either under __name__ or possibly a different label-key, such as original_name)
  • Somehow always stop users from the hitting the error parse error: ranges only allowed for vector selectors when they have been forced to use label_replace (by the metric-name being removed). ie. Have Prometheus consider label_replace not as a true part of the query which needs to be evaluated, but as a simple rename which can happen somehow differently at the start/end of the querying process...? I lack the words to describe this as I don't know the codebase, but I discussed this with @beorn7 a few months ago, and he thought it was a possibility (although maybe not until Prom v3).

  • Could we implement the keep_metric_names option that VictoriaMetrics has implemented?

  • “extend the cardinality API to cover time ranges” (Bryan Boreham’s suggestion - I think for a Mimir implementation)

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions