Evaluate independent rules concurrently#12946
Conversation
|
Thank you for your contribution. Historically, Prometheus has ensured a sequential execution of rules, which is a significant aspect to consider. Introducing parallel evaluation could potentially lead to scenarios where large groups dominate the full concurrency lock of the PromQL engine, as opposed to the current guarantee of a maximum of one per group. A possible alternative could be enhancing the Additionally, certain rules are dependent on the ALERT and ALERTS_* metrics, and it's unclear whether this proposal has accounted for such dependencies. |
Absolutely, I call this out in the description as well. I think this is a solvable problem, though. For example, we could limit the overall concurrency of the rule engine, introduce a priority queue, cooperative scheduling, etc. Would you be open to discussing those possibilities?
Can you elaborate for me why this would be a problem? Do you mean a recording rule ordered after an alerting rule could rely on the changes to |
|
I am not against this change but I think it should be clear to everyone that this can't be the default behaviour in Prometheus 2.x. I would also tend for it not to be the default in 3.x. |
|
That's fair. It's currently behind a flag in any case so the current behaviour will not be affected. |
|
@roidelapluie can you please take another look? I've added bounded concurrency; I think this should work quite nicely. |
Fixed in cf135c8 |
cf135c8 to
8f0ac08
Compare
d9adbf4 to
e5eac61
Compare
|
Relevant notes from the past that I don't see mentioned here yet:
That second point would strictly make it impossible to say for sure whether one rule depends on another, but it's a rare case for sure. |
gouthamve
left a comment
There was a problem hiding this comment.
Haven't taken a close look, but it looks OK from first glance.
How do we handle cases where the query itself overrides the metric name? If the expr: label_replace(<EXPR>, "custom_name","$1","__name__",".+")
| - record: TooManyHTTPRequests | ||
| expr: job:http_requests:rate15m > 100 | ||
|
|
||
| - name: grpc |
There was a problem hiding this comment.
Can we have them spliced in one test?
As in:
- job:http_requests:rate1m
- job:http_requests:rate15m
- job:http_requests:rate5m
- TooManyHTTPRequests
There was a problem hiding this comment.
@gouthamve I'm not understanding what you're asking, can you elaborate please?
They're all part of the same group and tested in TestBoundedRuleEvalConcurrency.
|
@roidelapluie @gouthamve following up on this, could I please request a review? Both the Loki & Mimir projects are pretty keen to see this merged so we can vendor it in to our projects. As an aside, the Mimir team did an analysis and found that only 1% of their customers' rules have dependent rules. I think this is a good sign that this change will have a large impact with minimal disruption. |
|
About the name changing via The result is metrics with the name So I guess the moment some |
|
About remote-read: Remote read can do weird things with a bunch of assumptions. Maybe let's not worry too much about it but make clear that users should only enable the feature flag when they know that no remote-read endpoint doing those weird things is involved. |
|
@beorn7 Doesn't the name always get overwritten by the rule name anyway? I assume this was about the dependency rule, not the dependent (where we only care about the actually selected data and not how it is renamed either). |
|
I think you are right @juliusv . If recording rules always create a series with the rule name as the same, then the |
|
Sounds like this is ready for a code-level review then. I am not very familiar with the rules engine and have a huge review backlog already… |
pracucci
left a comment
There was a problem hiding this comment.
Nice work Danny! The overall design LGTM. I left several minor comments and couple of major one.
docs/feature_flags.md
Outdated
| Rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the | ||
| output of a preceding rule as its input. However, if there is no detectable relationship between rules then there is no | ||
| reason to run them sequentially. This can improve rule reliability at the expense of adding more concurrent query | ||
| load. The number of concurrent rule evaluations can be configured with `--rules.max-concurrent-rule-evals` which is set | ||
| to `4` by default. No newline at end of file |
There was a problem hiding this comment.
[nit] I would clarify you're talking about the default behaviour, not when the feature flag is enabled.
| Rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the | |
| output of a preceding rule as its input. However, if there is no detectable relationship between rules then there is no | |
| reason to run them sequentially. This can improve rule reliability at the expense of adding more concurrent query | |
| load. The number of concurrent rule evaluations can be configured with `--rules.max-concurrent-rule-evals` which is set | |
| to `4` by default. | |
| By default, rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the | |
| output of a preceding rule as its input. However, if there is no detectable relationship between rules then there is no | |
| reason to run them sequentially. | |
| When the `concurrent-rule-eval` feature flag is enabled, rules without any dependency on other rules within a rule group will be evaluated concurrently. | |
| This can improve rule reliability at the expense of adding more concurrent query load. The number of concurrent rule evaluations can be configured | |
| with `--rules.max-concurrent-rule-evals` which is set to `4` by default. |
| defer c.mu.Unlock() | ||
|
|
||
| // Clear out the memoized dependency maps because some or all groups may have been updated. | ||
| c.depMaps = map[*Group]dependencyMap{} |
There was a problem hiding this comment.
[question] Given groups are indexed by pointer in the map, and given whenever a group config changes a new Group is created, can you confirm the concurrentRuleEvalController .Invalidate() purpose (I mean the actual implementation, not the interface) is to clear the cached dep map to avoid it growing indefinitely, rather than avoiding stale cache issues? I want to make sure I correctly understand it.
There was a problem hiding this comment.
Yes that's correct, to avoid the map growing indefinitely.
rules/manager.go
Outdated
| defer m.mtx.Unlock() | ||
|
|
||
| if m.opts.RuleConcurrencyController != nil { | ||
| m.opts.RuleConcurrencyController.Invalidate() |
There was a problem hiding this comment.
The Invalidate() contract (from the interface doc) is:
// Invalidate instructs the controller to invalidate its state.
// This should be called when groups are modified (during a reload, for instance), because the controller may
// store some state about each group in order to more efficiently determine rule eligibility.
I think calling Invalidate() here (even while the m.mtx is locked) doesn't guarantee race conditions in stale cached data, because old rule groups will keep being evaluated.
For example, consider this scenario:
- The controller caches the group based on the rule group filepath
Manager.Update()callsInvalidate()- The old rule group at filepath X evaluates, calling
RuleEligible(), the info is cached Manager.Update()reaches the point (below) where the old rule group is stopped and the new is loaded- The next
RuleEligible()call for the rule group at filepath X will return stale data
I intentionally picked "The controller caches the group based on the rule group filepath" example, because I think your implementation concurrentRuleEvalController doesn't suffer this problem, given it caches the Group by pointer.
However, since RuleConcurrencyController can be configured in the options and the interface contract doesn't say how a Group is supposed to be cached, I think it may lead to a potential misbehaviour by upstream projects.
I think solving this problem here in Manager.Update() is non trivial. Even if you move the call to Invalidate() below, there will be such race condition anyway.
I'm wondering if we should change the RuleConcurrencyController interface contract instead.
There was a problem hiding this comment.
I'm wondering if we should change the RuleConcurrencyController interface contract instead.
I may argue that we don't need Invalidate() at all. The controller implementation may use Group.Equals() to check whether the input Group is equal to the cached one. However, the annoying thing would be removing stale entries from the cache (e.g. you may need a LRU cache, instead of a very simple map like you're doing in concurrentRuleEvalController).
There was a problem hiding this comment.
We agreed to leave this as-is for now, and follow up with a subsequent change; we'll move the independent flag to the rule itself at first evaluation.
There was a problem hiding this comment.
we'll move the independent flag to the rule itself at first evaluation
I will take care of that follow up PR, if this PR will get accepted.
rules/manager_test.go
Outdated
| }, | ||
| } | ||
|
|
||
| inflightTracker := func(ctx context.Context) { |
There was a problem hiding this comment.
Why do we need this at all? Can't we just keep track of the high watermark in QueryFunc(), right after the call to inflightQueries.Add(1)?
e75c48d to
5f0aa5f
Compare
pracucci
left a comment
There was a problem hiding this comment.
Nice work @dannykopping ! The changes LGTM. I've started testing these changes in Mimir, as part of a bigger work we're doing and I haven't found any issue.
rules/manager.go
Outdated
| defer m.mtx.Unlock() | ||
|
|
||
| if m.opts.RuleConcurrencyController != nil { | ||
| m.opts.RuleConcurrencyController.Invalidate() |
There was a problem hiding this comment.
we'll move the independent flag to the rule itself at first evaluation
I will take care of that follow up PR, if this PR will get accepted.
There was a problem hiding this comment.
Overall, this LGTM code-wise! Great job @dannykopping on making this complicated change easy to follow.
Most of my suggestions are very minor nits that you can choose to implement or not. The only ones I expect to be addressed are the comments and the extra unit tests for the dependency map building.
@dannykopping, there's one more thing - could we move the code within manager.go and group.go to a separate file? Given this behaviour is completely optional, I don't think someone looking to understand or modify rule evaluation within Prometheus should have a need to read this unless they want to cover that specific use-case.
rules/group.go
Outdated
| } | ||
| } | ||
|
|
||
| if g.metrics != nil { |
There was a problem hiding this comment.
Great question, we already have a nil check in in line 97:
if metrics == nil {
metrics = NewGroupMetrics(o.Opts.Registerer)
}This should be removed.
This comment was marked as outdated.
This comment was marked as outdated.
NOTE: Rebased from main after refactor in prometheus#13014 Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
…ch request Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Updated & added tests Review feedback nits Return empty map if not indeterminate Use highWatermark to track inflight requests counter Appease the linter Clarify feature flag Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
8d429da to
6bbb03b
Compare
Signed-off-by: Marco Pracucci <[email protected]>
…aluation has done Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
| } | ||
|
|
||
| concurrencyController := o.Opts.RuleConcurrencyController | ||
| if concurrencyController == nil { |
There was a problem hiding this comment.
Note to reviewers: this is useful for unit tests, to having having to explicitly set it everywhere.
| } | ||
| }(i, rule) | ||
| } | ||
| if g.metrics != nil { |
There was a problem hiding this comment.
Note to reviewers: the check has removed because superfluous.
gotjosh
left a comment
There was a problem hiding this comment.
LGTM - thank you for taking the time to address my comments!
|
I'll let this bake for the next 24 hours in case any other maintainers have any more thoughts. The way that @pracucci has chosen to handle the comments (addressing each comment by commit) indicates the direction of the feature has not changed too much since the past couple of weeks. |

This PR aims to improve the problem of missed group evaluations by parallelizing work which can be parallelized to prevent groups from overrunning their interval.
Groups execute concurrently, while their constituent rules execute sequentially. Rules are evaluated sequentially because rules can have dependencies on one another. If one rule depends on the result of another rule, it is placed later in the order of execution by the operator.
However when no explicit relationships exists between rules (i.e. a rule does not depend on the resulting metric from a preceding recording rule), there is no reason to run these rules sequentially.
This PR determines if an explicit relationship can be found between two rules and, if not, it will evaluate the rule concurrently. The group will still wait for all rules to complete in order to maintain its current behaviour (i.e. if all rules cumulatively overrun the group interval, skip the next interval). This is relatively simple: enumerate all metrics used in each rule and if any of those are metrics produced by a recording rule, neither rule can execute concurrently.
This change does introduce some risk that multiple rules evaluating concurrently could put more bursty strain on the query engine. We may want to introduce some jitter, but wanted to open this up for discussion first.I've implemented bounded concurrency, which would allow operators to control how much additional query load this would put on their Prometheus server. If the concurrency limit is in force, rules will evaluate sequentially as before.