-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Stop removing __name__ from results: prevent vector cannot contain metrics with the same labelset #11397
Description
This issue was previously named as:
Identify highest DPM series via PromQL
Problem / Use-Case
As an Admin/Manager/Operator of a Prometheus instance one use-case is to understand, then manage-down the usage on the instance and the cost of running it. Sometimes there are a few metrics with very high DPM (ie. high frequency of scraping / small scrape-interval), which increase "usage" on the instance, and increase costs. For a multi-tenant project which implements PromQL (such as Mimir) this appears as a DPM overage and increase in the dollar costs for that tenant.
The Admin then needs to identify which metrics are responsible for the high DPM; which are the worst offenders. I've tried to find these with the following query:
topk(10, count_over_time({__name__!=""}[1m]))
Unfortunately this fails with:
execution: vector cannot contain metrics with the same labelset
Boiling this example down a bit, this error is present with the query count_over_time({__name__!=""}[1m]). And it's present for my small instance which only has 4,200 metrics. Google suggests to me a typical workaround for this error is to use label_replace to move the original metric-name into a temporary label. So incorporating this fix I get to this query:
topk(10, count_over_time(label_replace({__name__!=""},"name_label","$1","__name__", "(.+)")[1m:]))
But I've now introduced another problem. (Can you spot it?) I had to switch from [1m] to [1m:] ie. convert to a subquery, to avoid parse error: ranges only allowed for vector selectors. But by switching to a subquery, count_over_time no longer counting the DPM of the underlying metrics/series, but instead counts the subquery-evaluations during the interval in question!
So my question is this:
How can I identify the series that have the highest DPM via PromQL, so I can effectively manage cost and usage on my instance? (Or if it weren't via PromQL, how would we change Prometheus to better support this use-case, and how would operators of multi-tenant Prometheus services such as Mimir support their tenants to achieve the same).
Can we make changes to Prometheus to better support this use-case?
Note: I'm aware that the query I'm trying to run can be very computationally expensive, and may fail if the result-set is too big. But I'm not aware of alternatives for the case I'm describing.
Proposal / Ideas
I believe vector cannot contain metrics with the same labelset happens because Prometheus removes the metric-name when performing count_over_time. It does this for any function which changes the 'dimensions' of the metric. This is an opinionated decision taken by the project in the past - I think the opinion is that 'no metric name is better than the wrong metric name', and metrics names typically do include dimensions (such as …_seconds). But personally (as an instance admin, and as someone who builds features for the instance-admin persona) I've found myself hitting this error message a few times, and found it a barrier to learning and using Prometheus.
Some ideas then, to start some discussion:
-
Somehow always stop users from hitting the error
vector cannot contain metrics with the same labelset, while also retaining the rule that "we mustn't show the wrong metric name".- Retain the metric name, but add a suffix "…DIMS_CHANGED" (either under
__name__or possibly a different label-key, such asoriginal_name)
- Retain the metric name, but add a suffix "…DIMS_CHANGED" (either under
-
Somehow always stop users from the hitting the error
parse error: ranges only allowed for vector selectorswhen they have been forced to uselabel_replace(by the metric-name being removed). ie. Have Prometheus considerlabel_replacenot as a true part of the query which needs to be evaluated, but as a simple rename which can happen somehow differently at the start/end of the querying process...? I lack the words to describe this as I don't know the codebase, but I discussed this with @beorn7 a few months ago, and he thought it was a possibility (although maybe not until Prom v3). -
Could we implement the keep_metric_names option that VictoriaMetrics has implemented?
-
“extend the cardinality API to cover time ranges” (Bryan Boreham’s suggestion - I think for a Mimir implementation)
Metadata
Metadata
Assignees
Type
Projects
Status