Skip to content

Evaluate independent rules concurrently#12946

Merged
gotjosh merged 17 commits intoprometheus:mainfrom
dannykopping:dannykopping/rule-deps
Jan 30, 2024
Merged

Evaluate independent rules concurrently#12946
gotjosh merged 17 commits intoprometheus:mainfrom
dannykopping:dannykopping/rule-deps

Conversation

@dannykopping
Copy link
Copy Markdown
Contributor

@dannykopping dannykopping commented Oct 6, 2023

This PR aims to improve the problem of missed group evaluations by parallelizing work which can be parallelized to prevent groups from overrunning their interval.

Groups execute concurrently, while their constituent rules execute sequentially. Rules are evaluated sequentially because rules can have dependencies on one another. If one rule depends on the result of another rule, it is placed later in the order of execution by the operator.

However when no explicit relationships exists between rules (i.e. a rule does not depend on the resulting metric from a preceding recording rule), there is no reason to run these rules sequentially.

This PR determines if an explicit relationship can be found between two rules and, if not, it will evaluate the rule concurrently. The group will still wait for all rules to complete in order to maintain its current behaviour (i.e. if all rules cumulatively overrun the group interval, skip the next interval). This is relatively simple: enumerate all metrics used in each rule and if any of those are metrics produced by a recording rule, neither rule can execute concurrently.

This change does introduce some risk that multiple rules evaluating concurrently could put more bursty strain on the query engine. We may want to introduce some jitter, but wanted to open this up for discussion first.

I've implemented bounded concurrency, which would allow operators to control how much additional query load this would put on their Prometheus server. If the concurrency limit is in force, rules will evaluate sequentially as before.

@roidelapluie
Copy link
Copy Markdown
Member

Thank you for your contribution.

Historically, Prometheus has ensured a sequential execution of rules, which is a significant aspect to consider. Introducing parallel evaluation could potentially lead to scenarios where large groups dominate the full concurrency lock of the PromQL engine, as opposed to the current guarantee of a maximum of one per group.

A possible alternative could be enhancing the promtool lint to provide suggestions on rule splitting, rather than attempting to deduce rule independence which might introduce complexity.

Additionally, certain rules are dependent on the ALERT and ALERTS_* metrics, and it's unclear whether this proposal has accounted for such dependencies.

@dannykopping
Copy link
Copy Markdown
Contributor Author

Introducing parallel evaluation could potentially lead to scenarios where large groups dominate the full concurrency lock of the PromQL engine, as opposed to the current guarantee of a maximum of one per group.

Absolutely, I call this out in the description as well. I think this is a solvable problem, though. For example, we could limit the overall concurrency of the rule engine, introduce a priority queue, cooperative scheduling, etc. Would you be open to discussing those possibilities?

Additionally, certain rules are dependent on the ALERT and ALERTS_* metrics, and it's unclear whether this proposal has accounted for such dependencies.

Can you elaborate for me why this would be a problem? Do you mean a recording rule ordered after an alerting rule could rely on the changes to ALERTS.* for some behaviour? If so, I think we can mark that group as indeterminate which will force it to be synchronous (like if a rule does a wildcard metric query).

@roidelapluie
Copy link
Copy Markdown
Member

I am not against this change but I think it should be clear to everyone that this can't be the default behaviour in Prometheus 2.x. I would also tend for it not to be the default in 3.x.

@dannykopping
Copy link
Copy Markdown
Contributor Author

That's fair. It's currently behind a flag in any case so the current behaviour will not be affected.

@dannykopping
Copy link
Copy Markdown
Contributor Author

dannykopping commented Oct 8, 2023

@roidelapluie can you please take another look? I've added bounded concurrency; I think this should work quite nicely.
If the concurrency limit is in force, rules will evaluate sequentially as before.

@dannykopping dannykopping marked this pull request as ready for review October 8, 2023 09:21
@dannykopping
Copy link
Copy Markdown
Contributor Author

Additionally, certain rules are dependent on the ALERT and ALERTS_* metrics, and it's unclear whether this proposal has accounted for such dependencies.

Fixed in cf135c8

@beorn7 beorn7 requested a review from gouthamve October 24, 2023 11:49
@dannykopping dannykopping force-pushed the dannykopping/rule-deps branch from cf135c8 to 8f0ac08 Compare October 25, 2023 20:34
@dannykopping
Copy link
Copy Markdown
Contributor Author

NOTE: I have rebased by branch after some refactoring I did in #13014

I also took the opportunity to add a feature flag after discussing with @bboreham, and made the concurrency control mechanism simpler.

@dannykopping dannykopping force-pushed the dannykopping/rule-deps branch 3 times, most recently from d9adbf4 to e5eac61 Compare October 28, 2023 11:17
@juliusv
Copy link
Copy Markdown
Member

juliusv commented Nov 2, 2023

Relevant notes from the past that I don't see mentioned here yet:

That second point would strictly make it impossible to say for sure whether one rule depends on another, but it's a rare case for sure.

Copy link
Copy Markdown
Member

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't taken a close look, but it looks OK from first glance.

How do we handle cases where the query itself overrides the metric name? If the expr: label_replace(<EXPR>, "custom_name","$1","__name__",".+")

- record: TooManyHTTPRequests
expr: job:http_requests:rate15m > 100

- name: grpc
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have them spliced in one test?

As in:

  1. job:http_requests:rate1m
  2. job:http_requests:rate15m
  3. job:http_requests:rate5m
  4. TooManyHTTPRequests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gouthamve I'm not understanding what you're asking, can you elaborate please?
They're all part of the same group and tested in TestBoundedRuleEvalConcurrency.

@dannykopping
Copy link
Copy Markdown
Contributor Author

How do we handle cases where the query itself overrides the metric name? If the expr: label_replace(<EXPR>, "custom_name","$1","__name__",".+")

Is this valid in PromQL? I tried it but it doesn't seem to work.

image

It'll be trivial to add a case for it though, if necessary.

@dannykopping
Copy link
Copy Markdown
Contributor Author

dannykopping commented Nov 27, 2023

@roidelapluie @gouthamve following up on this, could I please request a review? Both the Loki & Mimir projects are pretty keen to see this merged so we can vendor it in to our projects.

As an aside, the Mimir team did an analysis and found that only 1% of their customers' rules have dependent rules. I think this is a good sign that this change will have a large impact with minimal disruption.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Nov 28, 2023

About the name changing via label_replace mentioned above: The syntax wasn't quite right, but it's possible. For example:

label_replace(up{container="prometheus"}, "__name__", "feep", "__name__", ".+")

The result is metrics with the name feep instead of up.

So I guess the moment some __name__ manipulation shows up, concurrent rule evaluation should be disabled.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Nov 28, 2023

About remote-read: Remote read can do weird things with a bunch of assumptions. Maybe let's not worry too much about it but make clear that users should only enable the feature flag when they know that no remote-read endpoint doing those weird things is involved.

@juliusv
Copy link
Copy Markdown
Member

juliusv commented Nov 28, 2023

@beorn7 Doesn't the name always get overwritten by the rule name anyway? I assume this was about the dependency rule, not the dependent (where we only care about the actually selected data and not how it is renamed either).

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Nov 28, 2023

I think you are right @juliusv . If recording rules always create a series with the rule name as the same, then the __name__ mangling isn't a problem.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Nov 29, 2023

Sounds like this is ready for a code-level review then.
@gouthamve @juliusv @roidelapluie could any of you take that?

I am not very familiar with the rules engine and have a huge review backlog already…

Copy link
Copy Markdown
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Danny! The overall design LGTM. I left several minor comments and couple of major one.

Comment on lines +196 to +200
Rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the
output of a preceding rule as its input. However, if there is no detectable relationship between rules then there is no
reason to run them sequentially. This can improve rule reliability at the expense of adding more concurrent query
load. The number of concurrent rule evaluations can be configured with `--rules.max-concurrent-rule-evals` which is set
to `4` by default. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I would clarify you're talking about the default behaviour, not when the feature flag is enabled.

Suggested change
Rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the
output of a preceding rule as its input. However, if there is no detectable relationship between rules then there is no
reason to run them sequentially. This can improve rule reliability at the expense of adding more concurrent query
load. The number of concurrent rule evaluations can be configured with `--rules.max-concurrent-rule-evals` which is set
to `4` by default.
By default, rule groups execute concurrently, but the rules within a group execute sequentially; this is because rules can use the
output of a preceding rule as its input. However, if there is no detectable relationship between rules then there is no
reason to run them sequentially.
When the `concurrent-rule-eval` feature flag is enabled, rules without any dependency on other rules within a rule group will be evaluated concurrently.
This can improve rule reliability at the expense of adding more concurrent query load. The number of concurrent rule evaluations can be configured
with `--rules.max-concurrent-rule-evals` which is set to `4` by default.

defer c.mu.Unlock()

// Clear out the memoized dependency maps because some or all groups may have been updated.
c.depMaps = map[*Group]dependencyMap{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Given groups are indexed by pointer in the map, and given whenever a group config changes a new Group is created, can you confirm the concurrentRuleEvalController .Invalidate() purpose (I mean the actual implementation, not the interface) is to clear the cached dep map to avoid it growing indefinitely, rather than avoiding stale cache issues? I want to make sure I correctly understand it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct, to avoid the map growing indefinitely.

rules/manager.go Outdated
defer m.mtx.Unlock()

if m.opts.RuleConcurrencyController != nil {
m.opts.RuleConcurrencyController.Invalidate()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Invalidate() contract (from the interface doc) is:

	// Invalidate instructs the controller to invalidate its state.
	// This should be called when groups are modified (during a reload, for instance), because the controller may
	// store some state about each group in order to more efficiently determine rule eligibility.

I think calling Invalidate() here (even while the m.mtx is locked) doesn't guarantee race conditions in stale cached data, because old rule groups will keep being evaluated.

For example, consider this scenario:

  1. The controller caches the group based on the rule group filepath
  2. Manager.Update() calls Invalidate()
  3. The old rule group at filepath X evaluates, calling RuleEligible(), the info is cached
  4. Manager.Update() reaches the point (below) where the old rule group is stopped and the new is loaded
  5. The next RuleEligible() call for the rule group at filepath X will return stale data

I intentionally picked "The controller caches the group based on the rule group filepath" example, because I think your implementation concurrentRuleEvalController doesn't suffer this problem, given it caches the Group by pointer.

However, since RuleConcurrencyController can be configured in the options and the interface contract doesn't say how a Group is supposed to be cached, I think it may lead to a potential misbehaviour by upstream projects.

I think solving this problem here in Manager.Update() is non trivial. Even if you move the call to Invalidate() below, there will be such race condition anyway.

I'm wondering if we should change the RuleConcurrencyController interface contract instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should change the RuleConcurrencyController interface contract instead.

I may argue that we don't need Invalidate() at all. The controller implementation may use Group.Equals() to check whether the input Group is equal to the cached one. However, the annoying thing would be removing stale entries from the cache (e.g. you may need a LRU cache, instead of a very simple map like you're doing in concurrentRuleEvalController).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to leave this as-is for now, and follow up with a subsequent change; we'll move the independent flag to the rule itself at first evaluation.

Copy link
Copy Markdown
Contributor

@pracucci pracucci Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll move the independent flag to the rule itself at first evaluation

I will take care of that follow up PR, if this PR will get accepted.

Copy link
Copy Markdown
Contributor

@pracucci pracucci Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my proposal here: #13527

},
}

inflightTracker := func(ctx context.Context) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this at all? Can't we just keep track of the high watermark in QueryFunc(), right after the call to inflightQueries.Add(1)?

Copy link
Copy Markdown
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @dannykopping ! The changes LGTM. I've started testing these changes in Mimir, as part of a bigger work we're doing and I haven't found any issue.

rules/manager.go Outdated
defer m.mtx.Unlock()

if m.opts.RuleConcurrencyController != nil {
m.opts.RuleConcurrencyController.Invalidate()
Copy link
Copy Markdown
Contributor

@pracucci pracucci Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll move the independent flag to the rule itself at first evaluation

I will take care of that follow up PR, if this PR will get accepted.

Copy link
Copy Markdown
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this LGTM code-wise! Great job @dannykopping on making this complicated change easy to follow.

Most of my suggestions are very minor nits that you can choose to implement or not. The only ones I expect to be addressed are the comments and the extra unit tests for the dependency map building.

@dannykopping, there's one more thing - could we move the code within manager.go and group.go to a separate file? Given this behaviour is completely optional, I don't think someone looking to understand or modify rule evaluation within Prometheus should have a need to read this unless they want to cover that specific use-case.

rules/group.go Outdated
}
}

if g.metrics != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, we already have a nil check in in line 97:

	if metrics == nil {
		metrics = NewGroupMetrics(o.Opts.Registerer)
	}

This should be removed.

@pracucci

This comment was marked as outdated.

dannykopping and others added 10 commits January 29, 2024 10:07
NOTE:
Rebased from main after refactor in prometheus#13014

Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Updated & added tests
Review feedback nits
Return empty map if not indeterminate
Use highWatermark to track inflight requests counter
Appease the linter
Clarify feature flag

Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci force-pushed the dannykopping/rule-deps branch from 8d429da to 6bbb03b Compare January 29, 2024 09:14
}

concurrencyController := o.Opts.RuleConcurrencyController
if concurrencyController == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this is useful for unit tests, to having having to explicitly set it everywhere.

}
}(i, rule)
}
if g.metrics != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: the check has removed because superfluous.

Copy link
Copy Markdown
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you for taking the time to address my comments!

@gotjosh
Copy link
Copy Markdown
Member

gotjosh commented Jan 29, 2024

I'll let this bake for the next 24 hours in case any other maintainers have any more thoughts. The way that @pracucci has chosen to handle the comments (addressing each comment by commit) indicates the direction of the feature has not changed too much since the past couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants